Overall performance

This page provides the benchmarking of different tools for gender inference. You can see the evaluation of various methods (inferring gender from images, names, or both) on the open-source datasets (more information on the datasets here). Get an overview of the methods' overall performance and how they perform when splitting the data by gender or image attributes (e.g., facial accessories, poses, or multiple faces in one image). We evaluated each method on the following performance metrics, which are defined at the bottom of this page: Coverage, Precision, Recall, and F1-score.


Model Dataset Coverage Precision Recall F1-score
M3 Images Only IMDB 1 0.76 0.76 0.76
wiki 1 0.89 0.89 0.88
OUI 1 0.69 0.69 0.69
Twitter 1 0.87 0.84 0.84
Gender Shade 1 0.95 0.94 0.94
Scholar 1 0.95 0.95 0.95
Deepface IMDB 0.57 0.78 0.78 0.78
wiki 0.48 0.92 0.91 0.90
OUI 0.57 0.66 0.66 0.63
Twitter 0.45 0.85 0.83 0.83
Gender Shade 0.85 0.84 0.78 0.76
Scholar 0.85 0.88 0.85 0.84

Model Dataset Male Female
Coverage Precision Recall F1-score Coverage Precision Recall F1-score
M3 Images Only IMDB 1 0.77 0.84 0.81 1 0.74 0.65 0.69
WIKI 1 0.90 0.98 0.94 1 0.87 0.55 0.68
OUI 1 0.67 0.63 0.65 1 0.71 0.74 0.72
Twitter 1 0.70 0.95 0.81 1 0.96 0.75 0.84
Gender Shades 1 0.91 0.99 0.95 1 0.99 0.87 0.93
Scholar 1 0.93 0.97 0.95 1 0.97 0.92 0.95
Deepface IMDB 0.53 0.78 0.84 0.81 0.62 0.79 0.71 0.75
WIKI 0.45 0.91 0.98 0.94 0.63 0.94 0.72 0.81
OUI 0.83 0.58 0.94 0.71 0.87 0.83 0.55 0.66
Twitter 0.47 0.76 0.99 0.86 0.38 0.98 0.70 0.81
Gender Shade 0.82 0.71 1 0.83 0.89 1 0.52 0.69
Scholar 0.87 0.78 1 0.87 0.82 1 0.67 0.80




Model Dataset Coverage Precision Recall F1-score
"Gender" R (SSA) IMDB 0.874 0.965 0.965 0.965
wiki 0.823 0.957 0.954 0.955
Twitter 0.434 0.943 0.943 0.942
Scholar 0.797 0.968 0.968 0.968
"Gender" R (IPUMS) IMDB 0.851 0.909 0.907 0.905
wiki 0.812 0.929 0.928 0.928
Twitter 0.46 0.846 0.843 0.842
Scholar 0.769 0.917 0.916 0.916
Gender_Guesser IMDB 0.89 0.965 0.965 0.965
wiki 0.865 0.969 0.968 0.968
Twitter 0.865 0.969 0.968 0.968
Scholar 0.768 0.978 0.978 0.978

Model Dataset Male Female
Coverage Precision Recall F1-score Coverage Precision Recall F1-score
"Gender" R (SSA) IMDB 0.834 0.95 0.961 0.955 0.813 0.975 0.967 0.971
WIKI 0.784 0.847 0.955 0.898 0.814 0.988 0.953 0.97
Twitter 0.431 0.90 0.943 0.921 0.437 0.945 0.903 0.923
Scholar 0.78 0.97 0.965 0.967 0.814 0.965 0.97 0.968
"Gender" R (IPUMS) IMDB 0.904 0.935 0.812 0.869 0.656 0.893 0.965 0.928
WIKI 0.803 0.815 0.839 0.827 0.706 0.958 0.951 0.954
Twitter 0.508 0.875 0.794 0.833 0.412 0.816 0.89 0.852
Scholar 0.807 0.942 0.882 0.911 0.728 0.894 0.948 0.92
Gender_Guesser IMDB 0.904 0.968 0.976 0.972 0.869 0.96 0.948 0.954
WIKI 0.872 0.988 0.971 0.979 0.845 0.898 0.954 0.925
Twitter 0.497 0.945 0.968 0.956 0.412 0.964 0.939 0.951
Scholar 0.753 0.969 0.988 0.978 0.785 0.987 0.967 0.977



Model Dataset Coverage Precision Recall F1-score
M3 (images + names) IMDB 1 0.94 0.94 0.94
wiki 1 0.96 0.96 0.96
Twitter 1 0.91 0.9 0.9
Scholar 1 0.96 0.96 0.96

Model Dataset Male Female
Coverage Precision Recall F1-score Coverage Precision Recall F1-score
M3 (images + names) IMDB 1 0.942 0.952 0.947 1 0.93 0.915 0.922
WIKI 1 0.975 0.972 0.973 1 0.89 0.903 0.896
Twitter 1 0.83 0.952 0.887 1 0.947 0.89 0.918
Scholar 1 0.937 0.991 0.963 1 0.989 0.928 0.958



Subgroup performance for images

In this section, we created five subgroups from OUI dataset. These subgroups' goal is to test the "Images methods" on different image features and how the performance will change from one subcategory to another. In addition to that, as the table below showed, the performance was split by sub- subcategory and gender.

Images methods

Model Metric Ethnicity Image quality Facial accessories Multiple faces Poses
African Asian European Blur Perfect covered head Glasses   Side Front
M3 (only images) Coverage 1 1 1 1 1 1 1 1 1 1
F1 Avg 0.93 0.95 0.99 0.96 0.98 0.86 0.94 0.97 0.94 0.99
F1 male 0.94 0.95 0.99 0.96 0.98 0.87 0.94 0.97 0.94 0.99
F1 Female 0.93 0.95 0.99 0.96 0.98 0.84 0.94 0.97 0.94 0.99
Deepface Coverage 0.62 0.83 0.87 0.86 0.73 0.55 0.73 0.88 0.62 0.92
F1 Avg 0.56 0.7 0.84 0.77 0.82 0.59 0.54 0.87 0.88 0.84
F1 male 0.74 0.78 0.86 0.82 0.84 0.72 0.73 0.89 0.91 0.87
F1 Female 0.34 0.63 0.82 0.72 0.8 0.47 0.33 0.85 0.86 0.82

Subgroups labeling

To perform the benchmark of gender classification models for different subgroups, we took the OUI dataset as our base data because it is close to real-world conditions. We created six distinct groups: Multiple Faces, Skin tone, Face Accessories, Image Quality, and Poses.

The table below depicts the distribution of male and female images for each of these groups and their subcategories.

Category Subcategory Female images Male images Total images
Image quality Blurry 50 50 200
High quality 50 50
Skin tone Black 50 50 300
Caucasian 50 50
Asian 50 50
Multiple faces 50 50 100
Facial accessories Glasses 50 50 200
Covered head 50 50
Poses Front 50 50 200
Side 50 50

As the summary table shows, each category consists of different subcategories. Each subcategory comprises 100 labeled images (50 images labeled as Male and 50 as Female). The images were chosen randomly from the original dataset based on the following conditions:

  • For the multiple faces group, we chose images with more than one face. We labeled the images based on the person with a full face in front or in the middle of the image.

  • For the facial accessories group, we selected only single-face images with no more than one accessory. An image is classified in the Glasses subgroup if the person wears glasses but no headgear and vice versa for the Headgear subgroup.

  • For the Image quality, we selected only single-face images. The Blurry subcategory contains images whose quality ranges from acceptable to very bad (unclear images). The High-quality subcategory contains images of perfect quality.

  • We also created a category based on the poses in the images. The Side subgroup contains images of a person captured from one side (side pose), and the Front subgroup includes images of a person captured from the front (front pose).

  • The Skin-tone category contains images of people classified based on their skin-tone (Black, Asian, and Caucasian). We selected only single-face images.


Evaluation metrics

In all the performance tables presented on our website, each class's confusion matrix (Male and Female) was used. In addition to that, we computed the following evaluation metrics: F1, precision, recall, and coverage.

  • Confusion matrix for "Male" class:

  • True gender
    Predicted gender True gender = Male &
    Pred gender = Male
    True gender = Female &
    Pred gender = Male
    True gender = Male &
    Pred gender = Female
    True gender = Female &
    Pred gender = Female

    $$Precision_M = {(True\ gender = Male\ \&\ Pred\ gender = Male) \over ((True\ gender = Male\ \&\ Pred\ gender = Male) + (True\ gender = Female\ \&\ Pred\ gender = Male))}$$

    $$Recall_M = {(True\ gender = Male\ \&\ Pred\ gender = Male) \over ((True\ gender = Male\ \&\ Pred\ gender = Male) + (True\ gender = Male\ \&\ Pred\ gender = Female))}$$


  • Confusion matrix for "Female" class:

  • True gender
    Predicted gender True gender = Female &
    Pred gender = Female
    True gender = Male &
    Pred gender = Female
    True gender = Female &
    Pred gender = Male
    True gender = Male &
    Pred gender = Male

    $$Precision_F = {(True\ gender = Female\ \&\ Pred\ gender = Female) \over ((True\ gender = Female\ \&\ Pred\ gender = Female) + (True\ gender = Male\ \&\ Pred\ gender = Female))}$$

    $$Recall_F = {(True\ gender = Female\ \&\ Pred\ gender = Female) \over ((True\ gender = Female\ \&\ Pred\ gender = Female) + (True\ gender = Female\ \&\ Pred\ gender = Male))}$$

  • The F1-score is calculated as follows:

  • $$F_1 = {2*Precision*Recall \over Precision + Recall}$$
  • The Coverage metric represents the proportion of a data set for which the given method returns a prediction.

More about performance metrics you can read on Wikipedia.