Performance benchmarking
This page provides the benchmarking of different tools for gender inference. You can see the evaluation of various methods (inferring gender from images, names or both) on the open-source datasets (more information on the datasets here). For each method coverage is presented as well as performance metrics (precision, recall and F1-score) and you can also split it by gender or image features.
F1-score on subgroups:
Method | Metric | Ethnicity | Image quality | Facial accessories | Poses | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
African | Asian | European | Blur | Perfect | Covered head | Glasses | Multiple faces | Side | Front | ||
Deepface | Coverage | 0.62 | 0.83 | 0.87 | 0.86 | 0.73 | 0.55 | 0.73 | 0.88 | 0.62 | 0.92 |
F1 Avg | 0.56 | 0.70 | 0.84 | 0.77 | 0.82 | 0.59 | 0.54 | 0.87 | 0.88 | 0.84 | |
F1 M | 0.74 | 0.78 | 0.86 | 0.82 | 0.84 | 0.72 | 0.73 | 0.89 | 0.91 | 0.87 | |
F1 F | 0.34 | 0.63 | 0.82 | 0.72 | 0.80 | 0.47 | 0.33 | 0.85 | 0.86 | 0.82 | |
M3 (only images) | Coverage | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
F1 Avg | 0.93 | 0.95 | 0.99 | 0.96 | 0.98 | 0.86 | 0.94 | 0.97 | 0.94 | 0.99 | |
F1 M | 0.94 | 0.95 | 0.99 | 0.96 | 0.98 | 0.87 | 0.94 | 0.97 | 0.94 | 0.99 | |
F1 F | 0.93 | 0.95 | 0.99 | 0.96 | 0.98 | 0.84 | 0.94 | 0.97 | 0.94 | 0.99 |
Subgroup creation
Subgroups were created for testing performance of M3 and CNN Baseline on specific image features. They contain images from OUI dataset, that have been manually chosen.
Each dataset has 100 labeled images: 50 images were labeled as 'Male' and the rest were labeled as 'Female'. In total, there are 10 datasets divided into 6 categories and each category has one or two subcategory:
Category | Subcategory | Female images | Male images | Total images |
---|---|---|---|---|
Image quality | Blurry | 50 | 50 | 200 |
High quality | 50 | 50 | ||
Ethnicity | African | 50 | 50 | 300 |
European | 50 | 50 | ||
Asian | 50 | 50 | ||
Multiple faces | 50 | 50 | 100 | |
Facial accessories | Glasses | 50 | 50 | 200 |
Covered head | 50 | 50 | ||
Poses | Front | 50 | 50 | 200 |
Side | 50 | 50 |
Evaluation explanation
In the tables above all measurements of evaluation use the confusion matrix for every class (Male and Female).
If we are looking at the "Male" class:
-
And the following if we are looking at the "Female" class:
True gender | ||
---|---|---|
Predicted gender | True gender = Male & Pred gender = Male |
True gender = Female & Pred gender = Male |
True gender = Male & Pred gender = Female |
True gender = Female & Pred gender = Female |
$$Precision_M = {(True\ gender = Male\ \&\ Pred\ gender = Male) \over ((True\ gender = Male\ \&\ Pred\ gender = Male) + (True\ gender = Female\ \&\ Pred\ gender = Male))}$$
$$Recall_M = {(True\ gender = Male\ \&\ Pred\ gender = Male) \over ((True\ gender = Male\ \&\ Pred\ gender = Male) + (True\ gender = Male\ \&\ Pred\ gender = Female))}$$
True gender | ||
---|---|---|
Predicted gender | True gender = Female & Pred gender = Female |
True gender = Male & Pred gender = Female |
True gender = Female & Pred gender = Male |
True gender = Male & Pred gender = Male |
$$Precision_F = {(True\ gender = Female\ \&\ Pred\ gender = Female) \over ((True\ gender = Female\ \&\ Pred\ gender = Female) + (True\ gender = Male\ \&\ Pred\ gender = Female))}$$
$$Recall_F = {(True\ gender = Female\ \&\ Pred\ gender = Female) \over ((True\ gender = Female\ \&\ Pred\ gender = Female) + (True\ gender = Female\ \&\ Pred\ gender = Male))}$$
More about performance metrics you can read on Wikipedia.