Overall performance
This page provides the benchmarking of different tools for gender inference. You can see the evaluation of various methods (inferring gender from images, names, or both) on the open-source datasets (more information on the datasets here). Get an overview of the methods' overall performance and how they perform when splitting the data by gender or image attributes (e.g., facial accessories, poses, or multiple faces in one image). We evaluated each method on the following performance metrics, which are defined at the bottom of this page: Coverage, Precision, Recall, and F1-score.
Model | Dataset | Coverage | Precision | Recall | F1-score |
---|---|---|---|---|---|
M3 Images Only | IMDB | 1 | 0.76 | 0.76 | 0.76 |
wiki | 1 | 0.89 | 0.89 | 0.88 | |
OUI | 1 | 0.69 | 0.69 | 0.69 | |
1 | 0.87 | 0.84 | 0.84 | ||
Gender Shade | 1 | 0.95 | 0.94 | 0.94 | |
Scholar | 1 | 0.95 | 0.95 | 0.95 | |
Deepface | IMDB | 0.57 | 0.78 | 0.78 | 0.78 |
wiki | 0.48 | 0.92 | 0.91 | 0.90 | |
OUI | 0.57 | 0.66 | 0.66 | 0.63 | |
0.45 | 0.85 | 0.83 | 0.83 | ||
Gender Shade | 0.85 | 0.84 | 0.78 | 0.76 | |
Scholar | 0.85 | 0.88 | 0.85 | 0.84 |
Model | Dataset | Male | Female | ||||||
---|---|---|---|---|---|---|---|---|---|
Coverage | Precision | Recall | F1-score | Coverage | Precision | Recall | F1-score | ||
M3 Images Only | IMDB | 1 | 0.77 | 0.84 | 0.81 | 1 | 0.74 | 0.65 | 0.69 |
WIKI | 1 | 0.90 | 0.98 | 0.94 | 1 | 0.87 | 0.55 | 0.68 | |
OUI | 1 | 0.67 | 0.63 | 0.65 | 1 | 0.71 | 0.74 | 0.72 | |
1 | 0.70 | 0.95 | 0.81 | 1 | 0.96 | 0.75 | 0.84 | ||
Gender Shades | 1 | 0.91 | 0.99 | 0.95 | 1 | 0.99 | 0.87 | 0.93 | |
Scholar | 1 | 0.93 | 0.97 | 0.95 | 1 | 0.97 | 0.92 | 0.95 | |
Deepface | IMDB | 0.53 | 0.78 | 0.84 | 0.81 | 0.62 | 0.79 | 0.71 | 0.75 |
WIKI | 0.45 | 0.91 | 0.98 | 0.94 | 0.63 | 0.94 | 0.72 | 0.81 | |
OUI | 0.83 | 0.58 | 0.94 | 0.71 | 0.87 | 0.83 | 0.55 | 0.66 | |
0.47 | 0.76 | 0.99 | 0.86 | 0.38 | 0.98 | 0.70 | 0.81 | ||
Gender Shade | 0.82 | 0.71 | 1 | 0.83 | 0.89 | 1 | 0.52 | 0.69 | |
Scholar | 0.87 | 0.78 | 1 | 0.87 | 0.82 | 1 | 0.67 | 0.80 |
Model | Dataset | Coverage | Precision | Recall | F1-score |
---|---|---|---|---|---|
"Gender" R (SSA) | IMDB | 0.874 | 0.965 | 0.965 | 0.965 |
wiki | 0.823 | 0.957 | 0.954 | 0.955 | |
0.434 | 0.943 | 0.943 | 0.942 | ||
Scholar | 0.797 | 0.968 | 0.968 | 0.968 | |
"Gender" R (IPUMS) | IMDB | 0.851 | 0.909 | 0.907 | 0.905 |
wiki | 0.812 | 0.929 | 0.928 | 0.928 | |
0.46 | 0.846 | 0.843 | 0.842 | ||
Scholar | 0.769 | 0.917 | 0.916 | 0.916 | Gender_Guesser | IMDB | 0.89 | 0.965 | 0.965 | 0.965 |
wiki | 0.865 | 0.969 | 0.968 | 0.968 | |
0.865 | 0.969 | 0.968 | 0.968 | ||
Scholar | 0.768 | 0.978 | 0.978 | 0.978 |
Model | Dataset | Male | Female | ||||||
---|---|---|---|---|---|---|---|---|---|
Coverage | Precision | Recall | F1-score | Coverage | Precision | Recall | F1-score | ||
"Gender" R (SSA) | IMDB | 0.834 | 0.95 | 0.961 | 0.955 | 0.813 | 0.975 | 0.967 | 0.971 |
WIKI | 0.784 | 0.847 | 0.955 | 0.898 | 0.814 | 0.988 | 0.953 | 0.97 | |
0.431 | 0.90 | 0.943 | 0.921 | 0.437 | 0.945 | 0.903 | 0.923 | ||
Scholar | 0.78 | 0.97 | 0.965 | 0.967 | 0.814 | 0.965 | 0.97 | 0.968 | |
"Gender" R (IPUMS) | IMDB | 0.904 | 0.935 | 0.812 | 0.869 | 0.656 | 0.893 | 0.965 | 0.928 |
WIKI | 0.803 | 0.815 | 0.839 | 0.827 | 0.706 | 0.958 | 0.951 | 0.954 | |
0.508 | 0.875 | 0.794 | 0.833 | 0.412 | 0.816 | 0.89 | 0.852 | ||
Scholar | 0.807 | 0.942 | 0.882 | 0.911 | 0.728 | 0.894 | 0.948 | 0.92 | |
Gender_Guesser | IMDB | 0.904 | 0.968 | 0.976 | 0.972 | 0.869 | 0.96 | 0.948 | 0.954 |
WIKI | 0.872 | 0.988 | 0.971 | 0.979 | 0.845 | 0.898 | 0.954 | 0.925 | |
0.497 | 0.945 | 0.968 | 0.956 | 0.412 | 0.964 | 0.939 | 0.951 | ||
Scholar | 0.753 | 0.969 | 0.988 | 0.978 | 0.785 | 0.987 | 0.967 | 0.977 |
Model | Dataset | Coverage | Precision | Recall | F1-score |
---|---|---|---|---|---|
M3 (images + names) | IMDB | 1 | 0.94 | 0.94 | 0.94 |
wiki | 1 | 0.96 | 0.96 | 0.96 | |
1 | 0.91 | 0.9 | 0.9 | ||
Scholar | 1 | 0.96 | 0.96 | 0.96 |
Model | Dataset | Male | Female | ||||||
---|---|---|---|---|---|---|---|---|---|
Coverage | Precision | Recall | F1-score | Coverage | Precision | Recall | F1-score | ||
M3 (images + names) | IMDB | 1 | 0.942 | 0.952 | 0.947 | 1 | 0.93 | 0.915 | 0.922 |
WIKI | 1 | 0.975 | 0.972 | 0.973 | 1 | 0.89 | 0.903 | 0.896 | |
1 | 0.83 | 0.952 | 0.887 | 1 | 0.947 | 0.89 | 0.918 | ||
Scholar | 1 | 0.937 | 0.991 | 0.963 | 1 | 0.989 | 0.928 | 0.958 |
Subgroup performance for images
In this section, we created five subgroups from OUI dataset. These subgroups' goal is to test the "Images methods" on different image features and how the performance will change from one subcategory to another. In addition to that, as the table below showed, the performance was split by sub- subcategory and gender.
Images methods
Model | Metric | Ethnicity | Image quality | Facial accessories | Multiple faces | Poses | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
African | Asian | European | Blur | Perfect | covered head | Glasses | Side | Front | |||
M3 (only images) | Coverage | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
F1 Avg | 0.93 | 0.95 | 0.99 | 0.96 | 0.98 | 0.86 | 0.94 | 0.97 | 0.94 | 0.99 | |
F1 male | 0.94 | 0.95 | 0.99 | 0.96 | 0.98 | 0.87 | 0.94 | 0.97 | 0.94 | 0.99 | |
F1 Female | 0.93 | 0.95 | 0.99 | 0.96 | 0.98 | 0.84 | 0.94 | 0.97 | 0.94 | 0.99 | |
Deepface | Coverage | 0.62 | 0.83 | 0.87 | 0.86 | 0.73 | 0.55 | 0.73 | 0.88 | 0.62 | 0.92 |
F1 Avg | 0.56 | 0.7 | 0.84 | 0.77 | 0.82 | 0.59 | 0.54 | 0.87 | 0.88 | 0.84 | |
F1 male | 0.74 | 0.78 | 0.86 | 0.82 | 0.84 | 0.72 | 0.73 | 0.89 | 0.91 | 0.87 | |
F1 Female | 0.34 | 0.63 | 0.82 | 0.72 | 0.8 | 0.47 | 0.33 | 0.85 | 0.86 | 0.82 |
Subgroups labeling
To perform the benchmark of gender classification models for different subgroups, we took the OUI dataset as our base data because it is close to real-world conditions. We created six distinct groups: Multiple Faces, Skin tone, Face Accessories, Image Quality, and Poses.
The table below depicts the distribution of male and female images for each of these groups and their subcategories.
Category | Subcategory | Female images | Male images | Total images |
---|---|---|---|---|
Image quality | Blurry | 50 | 50 | 200 |
High quality | 50 | 50 | ||
Skin tone | Black | 50 | 50 | 300 |
Caucasian | 50 | 50 | ||
Asian | 50 | 50 | ||
Multiple faces | 50 | 50 | 100 | |
Facial accessories | Glasses | 50 | 50 | 200 |
Covered head | 50 | 50 | ||
Poses | Front | 50 | 50 | 200 |
Side | 50 | 50 |
As the summary table shows, each category consists of different subcategories. Each subcategory comprises 100 labeled images (50 images labeled as Male and 50 as Female). The images were chosen randomly from the original dataset based on the following conditions:
-
For the multiple faces group, we chose images with more than one face. We labeled the images based on the person with a full face in front or in the middle of the image.
-
For the facial accessories group, we selected only single-face images with no more than one accessory. An image is classified in the Glasses subgroup if the person wears glasses but no headgear and vice versa for the Headgear subgroup.
-
For the Image quality, we selected only single-face images. The Blurry subcategory contains images whose quality ranges from acceptable to very bad (unclear images). The High-quality subcategory contains images of perfect quality.
-
We also created a category based on the poses in the images. The Side subgroup contains images of a person captured from one side (side pose), and the Front subgroup includes images of a person captured from the front (front pose).
-
The Skin-tone category contains images of people classified based on their skin-tone (Black, Asian, and Caucasian). We selected only single-face images.
Evaluation metrics
In all the performance tables presented on our website, each class's confusion matrix (Male and Female) was used. In addition to that, we computed the following evaluation metrics: F1, precision, recall, and coverage.
Confusion matrix for "Male" class:
-
Confusion matrix for "Female" class:
-
The F1-score is calculated as follows:
- The Coverage metric represents the proportion of a data set for which the given method returns a prediction.
True gender | ||
---|---|---|
Predicted gender | True gender = Male & Pred gender = Male |
True gender = Female & Pred gender = Male |
True gender = Male & Pred gender = Female |
True gender = Female & Pred gender = Female |
$$Precision_M = {(True\ gender = Male\ \&\ Pred\ gender = Male) \over ((True\ gender = Male\ \&\ Pred\ gender = Male) + (True\ gender = Female\ \&\ Pred\ gender = Male))}$$
$$Recall_M = {(True\ gender = Male\ \&\ Pred\ gender = Male) \over ((True\ gender = Male\ \&\ Pred\ gender = Male) + (True\ gender = Male\ \&\ Pred\ gender = Female))}$$
True gender | ||
---|---|---|
Predicted gender | True gender = Female & Pred gender = Female |
True gender = Male & Pred gender = Female |
True gender = Female & Pred gender = Male |
True gender = Male & Pred gender = Male |
$$Precision_F = {(True\ gender = Female\ \&\ Pred\ gender = Female) \over ((True\ gender = Female\ \&\ Pred\ gender = Female) + (True\ gender = Male\ \&\ Pred\ gender = Female))}$$
$$Recall_F = {(True\ gender = Female\ \&\ Pred\ gender = Female) \over ((True\ gender = Female\ \&\ Pred\ gender = Female) + (True\ gender = Female\ \&\ Pred\ gender = Male))}$$
More about performance metrics you can read on Wikipedia.