Performance benchmarking

This page provides the benchmarking of different tools for gender inference. You can see the evaluation of various methods (inferring gender from images, names or both) on the open-source datasets (more information on the datasets here). For each method coverage is presented as well as performance metrics (precision, recall and F1-score) and you can also split it by gender or image features.

F1-score on subgroups:

Method Metric Ethnicity Image quality Facial accessories Poses
African Asian European Blur Perfect Covered head Glasses Multiple faces Side Front
Deepface Coverage 0.62 0.83 0.87 0.86 0.73 0.55 0.73 0.88 0.62 0.92
F1 Avg 0.56 0.70 0.84 0.77 0.82 0.59 0.54 0.87 0.88 0.84
F1 M 0.74 0.78 0.86 0.82 0.84 0.72 0.73 0.89 0.91 0.87
F1 F 0.34 0.63 0.82 0.72 0.80 0.47 0.33 0.85 0.86 0.82
M3 (only images) Coverage 1 1 1 1 1 1 1 1 1 1
F1 Avg 0.93 0.95 0.99 0.96 0.98 0.86 0.94 0.97 0.94 0.99
F1 M 0.94 0.95 0.99 0.96 0.98 0.87 0.94 0.97 0.94 0.99
F1 F 0.93 0.95 0.99 0.96 0.98 0.84 0.94 0.97 0.94 0.99

Subgroup creation

Subgroups were created for testing performance of M3 and CNN Baseline on specific image features. They contain images from OUI dataset, that have been manually chosen.

Each dataset has 100 labeled images: 50 images were labeled as 'Male' and the rest were labeled as 'Female'. In total, there are 10 datasets divided into 6 categories and each category has one or two subcategory:

Category Subcategory Female images Male images Total images
Image quality Blurry 50 50 200
High quality 50 50
Ethnicity African 50 50 300
European 50 50
Asian 50 50
Multiple faces 50 50 100
Facial accessories Glasses 50 50 200
Covered head 50 50
Poses Front 50 50 200
Side 50 50

Evaluation explanation

In the tables above all measurements of evaluation use the confusion matrix for every class (Male and Female).

And the F1-score is calculated the following way: $$F_1 = {2*Precision*Recall \over Precision + Recall}$$

More about performance metrics you can read on Wikipedia.