Inferring gender from names

On this page, we introduce methods for gender inference that are solely based on people's first and last names. The majority of them rely on large name datasets such as SSA or IPUMS. We cover here a couple of corresponding R and Python packages.

Gender ‘R’ SSA

Data

The US Social Security Administration (SSA) covers registered baby names in the United States since 1880. The data is released as a series of text files, one for each year. Each line contains the name, gender, and number of new Social Security cards issued to babies of that name and gender in the given year.

How to use

Many gender detection tools such as the "gender" package in R or the OpenGenderTracking rely on this database. In addition to the SSA data, the R package offers gender inference based on data from the U.S. Census Bureau (IPUMS USA) and the North Atlantic Population Project. The gender() function takes a character vector of names, a year or range of years, and the desired dataset's name as input and predicts the corresponding genders. For example, to use the SSA data, you should set the parameter "method" to "ssa". Below you can see an example of gender predictions for the names "Madison" and "Hillary".

gender r package — Example of R "Gender" package with SSA data

You can try out this method with executable python notebooks powered by Binder. Click the button below to launch it and follow the instructions:

Binder is a service that deploys computing environments on the cloud.

Keep in mind that these notebooks are made for demonstration of the method and are not designed for large datasets.

Limitations

The R "Gender" package has many limitations, and before you use this package, be sure to take into account the following guidelines:

The link between gender and naming practices, like language itself, is not static. Hence, at different time periods, names can belong to a different gender. For example, in 1900, 92% of the babies born in the United States named Leslie were classified as male. In 2000, about 96% of the Leslies born in that year were classified as female. Therefore, you need to specify a timeframe when using this approach.
As it is a U.S name database, the performance is likely to deteriorate if you use this method for gender inference in other regions.

Gender ‘R’ IPUMS

Data

The integrated Public Use Microdata Series (IPUMS USA) census data consists of samples of the American population drawn from fifteen federal censuses from the American Community Surveys between 1790 to 2010.

How to use

This database is also included in the "gender" package in R and other web-based name extraction packages (see here). To use the R 'gender' package with IPUMS, you should set the parameter “method” to “ipums”:

ipums_gender_package — Example of R "Gender" package with IPUMS data

Limitations

Please refer to the corresponding subsection above for the limitations of this method.

Gender-Guesser

Data

Gender-Guesser is a python package that uses the list of 40,000 names primarily collected by Jörg Michael. The data has been manually collected and independently classified by several native speakers. The advantage of this name list is that it provides detailed information about how popular a first name is in a country and how strongly it is associated with a given gender. Therefore, it enables the disambiguation of names based on the country of origin. The list also provides information for a variety of countries, including China and India.

How to use

You can install a library using pip or conda commands and use a function "get_gender". Given a name, gender-guesser suggests whether the name is male, mostly male, female, mostly female, or unclear with an approximate frequency of each name per country. It also supports I18N for localization of names and you can specify a country parameter for better predictions.

Several other libraries are operating on the same list (see, for example, in C (gender.c) or Python’s Sexmachine library). However, most of them are not Python-3 compatible.

You can try out this method with executable python notebooks powered by Binder. Click the button below to launch it and follow the instructions:

Binder is a service that deploys computing environments on the cloud.

Keep in mind that these notebooks are made for demonstration of the method and are not designed for large datasets.

Limitations

As mentioned above, the results would be more accurate if you specify a country of origin for each input. The gender assignment of this collection is presumed to be of high quality, with manual checks by native speakers of various countries. However, the dictionary of names was published a decade ago and has not been updated since, which limits the usefulness of both the package and its underlying data source.

Genderize (Commercial tool)

Note: The "Performance" section does not cover the following tools due to their constant changes.

Data

Genderize - one of the most extensively used tools in research - utilizes user profiles from major social networks across 79 countries, and 89 languages. In April 2015, there were 212 252 unique terms gathered from about 2 million social network profiles (Wais, 2016).

How it works

The API is free for up to 1000 names/day, and no registration or API key is required. The response includes a gender with a confidence value and a count of data rows examined in order to get the outcome. If you need to predict more than 1000 names/day, you need to obtain an API key from store.genderize. The key should be appended to every request. There are several wrapper libraries in languages such as Ruby, Python, JS, R available. See more information on them here.

You can directly use the API. However, a batch (multiple names in a request) is limited to 10 names at a time. The API also accepts an optional "country_id" parameter for when you have the country origin.

genderize — Example of Genderize API response

Limitations

The daily limit puts a significant restriction on this method's usage. However, the main criticism is directed at data reliability as neither the social networks nor the exact number of profiles is revealed. There is no guarantee that each profile has a valid first name and gender parameter at a particular time. As users can enter a random character string in a name field, the database could contain erroneous data. In addition to that, the providers do not reveal the sources they employed. Hence, the reliability of the data is difficult to estimate.