Gender Inference GESIS - Image Only Inference

Overall performance

This page provides the benchmarking of different tools for gender inference. You can see the evaluation of various methods (inferring gender from images, names, or both) on the open-source datasets (more information on the datasets here). Get an overview of the methods' overall performance and how they perform when splitting the data by gender or image attributes (e.g., facial accessories, poses, or multiple faces in one image). We evaluated each method on the following performance metrics, which are defined at the bottom of this page: Coverage, Precision, Recall, and F1-score.

Model	Dataset	Coverage	Precision	Recall	F1-score
M3 Images Only	IMDB	1	0.76	0.76	0.76
	wiki	1	0.89	0.89	0.88
	OUI	1	0.69	0.69	0.69
	Twitter	1	0.87	0.84	0.84
	Gender Shade	1	0.95	0.94	0.94
	Scholar	1	0.95	0.95	0.95
Deepface	IMDB	0.57	0.78	0.78	0.78
	wiki	0.48	0.92	0.91	0.90
	OUI	0.57	0.66	0.66	0.63
	Twitter	0.45	0.85	0.83	0.83
	Gender Shade	0.85	0.84	0.78	0.76
	Scholar	0.85	0.88	0.85	0.84

Model	Dataset	Male				Female
Model	Dataset	Coverage	Precision	Recall	F1-score	Coverage	Precision	Recall	F1-score
M3 Images Only	IMDB	1	0.77	0.84	0.81	1	0.74	0.65	0.69
	WIKI	1	0.90	0.98	0.94	1	0.87	0.55	0.68
	OUI	1	0.67	0.63	0.65	1	0.71	0.74	0.72
	Twitter	1	0.70	0.95	0.81	1	0.96	0.75	0.84
	Gender Shades	1	0.91	0.99	0.95	1	0.99	0.87	0.93
	Scholar	1	0.93	0.97	0.95	1	0.97	0.92	0.95
Deepface	IMDB	0.53	0.78	0.84	0.81	0.62	0.79	0.71	0.75
	WIKI	0.45	0.91	0.98	0.94	0.63	0.94	0.72	0.81
	OUI	0.83	0.58	0.94	0.71	0.87	0.83	0.55	0.66
	Twitter	0.47	0.76	0.99	0.86	0.38	0.98	0.70	0.81
	Gender Shade	0.82	0.71	1	0.83	0.89	1	0.52	0.69
	Scholar	0.87	0.78	1	0.87	0.82	1	0.67	0.80

Model	Dataset	Coverage	Precision	Recall	F1-score
"Gender" R (SSA)	IMDB	0.874	0.965	0.965	0.965
	wiki	0.823	0.957	0.954	0.955
	Twitter	0.434	0.943	0.943	0.942
	Scholar	0.797	0.968	0.968	0.968
"Gender" R (IPUMS)	IMDB	0.851	0.909	0.907	0.905
	wiki	0.812	0.929	0.928	0.928
	Twitter	0.46	0.846	0.843	0.842
	Scholar	0.769	0.917	0.916	0.916
Gender_Guesser	IMDB	0.89	0.965	0.965	0.965
	wiki	0.865	0.969	0.968	0.968
	Twitter	0.865	0.969	0.968	0.968
	Scholar	0.768	0.978	0.978	0.978

Model	Dataset	Male				Female
Model	Dataset	Coverage	Precision	Recall	F1-score	Coverage	Precision	Recall	F1-score
"Gender" R (SSA)	IMDB	0.834	0.95	0.961	0.955	0.813	0.975	0.967	0.971
	WIKI	0.784	0.847	0.955	0.898	0.814	0.988	0.953	0.97
	Twitter	0.431	0.90	0.943	0.921	0.437	0.945	0.903	0.923
	Scholar	0.78	0.97	0.965	0.967	0.814	0.965	0.97	0.968
"Gender" R (IPUMS)	IMDB	0.904	0.935	0.812	0.869	0.656	0.893	0.965	0.928
	WIKI	0.803	0.815	0.839	0.827	0.706	0.958	0.951	0.954
	Twitter	0.508	0.875	0.794	0.833	0.412	0.816	0.89	0.852
	Scholar	0.807	0.942	0.882	0.911	0.728	0.894	0.948	0.92
Gender_Guesser	IMDB	0.904	0.968	0.976	0.972	0.869	0.96	0.948	0.954
	WIKI	0.872	0.988	0.971	0.979	0.845	0.898	0.954	0.925
	Twitter	0.497	0.945	0.968	0.956	0.412	0.964	0.939	0.951
	Scholar	0.753	0.969	0.988	0.978	0.785	0.987	0.967	0.977

Model	Dataset	Coverage	Precision	Recall	F1-score
M3 (images + names)	IMDB	1	0.94	0.94	0.94
	wiki	1	0.96	0.96	0.96
	Twitter	1	0.91	0.9	0.9
	Scholar	1	0.96	0.96	0.96

Model	Dataset	Male				Female
Model	Dataset	Coverage	Precision	Recall	F1-score	Coverage	Precision	Recall	F1-score
M3 (images + names)	IMDB	1	0.942	0.952	0.947	1	0.93	0.915	0.922
	WIKI	1	0.975	0.972	0.973	1	0.89	0.903	0.896
	Twitter	1	0.83	0.952	0.887	1	0.947	0.89	0.918
	Scholar	1	0.937	0.991	0.963	1	0.989	0.928	0.958

Subgroup performance for images

In this section, we created five subgroups from OUI dataset. These subgroups' goal is to test the "Images methods" on different image features and how the performance will change from one subcategory to another. In addition to that, as the table below showed, the performance was split by sub- subcategory and gender.

Images methods

Model	Metric	Ethnicity			Image quality		Facial accessories		Multiple faces	Poses
Model	Metric	African	Asian	European	Blur	Perfect	covered head	Glasses		Side	Front
M3 (only images)	Coverage	1	1	1	1	1	1	1	1	1	1
	F1 Avg	0.93	0.95	0.99	0.96	0.98	0.86	0.94	0.97	0.94	0.99
	F1 male	0.94	0.95	0.99	0.96	0.98	0.87	0.94	0.97	0.94	0.99
	F1 Female	0.93	0.95	0.99	0.96	0.98	0.84	0.94	0.97	0.94	0.99
Deepface	Coverage	0.62	0.83	0.87	0.86	0.73	0.55	0.73	0.88	0.62	0.92
	F1 Avg	0.56	0.7	0.84	0.77	0.82	0.59	0.54	0.87	0.88	0.84
	F1 male	0.74	0.78	0.86	0.82	0.84	0.72	0.73	0.89	0.91	0.87
	F1 Female	0.34	0.63	0.82	0.72	0.8	0.47	0.33	0.85	0.86	0.82

Subgroups labeling

To perform the benchmark of gender classification models for different subgroups, we took the OUI dataset as our base data because it is close to real-world conditions. We created six distinct groups: Multiple Faces, Skin tone, Face Accessories, Image Quality, and Poses.

The table below depicts the distribution of male and female images for each of these groups and their subcategories.

Category	Subcategory	Female images	Male images	Total images
Image quality	Blurry	50	50	200
Image quality	High quality	50	50	200
Skin tone	Black	50	50	300
	Caucasian	50	50
	Asian	50	50
Multiple faces		50	50	100
Facial accessories	Glasses	50	50	200
Facial accessories	Covered head	50	50	200
Poses	Front	50	50	200
Poses	Side	50	50	200

As the summary table shows, each category consists of different subcategories. Each subcategory comprises 100 labeled images (50 images labeled as Male and 50 as Female). The images were chosen randomly from the original dataset based on the following conditions:

For the multiple faces group, we chose images with more than one face. We labeled the images based on the person with a full face in front or in the middle of the image.
For the facial accessories group, we selected only single-face images with no more than one accessory. An image is classified in the Glasses subgroup if the person wears glasses but no headgear and vice versa for the Headgear subgroup.
For the Image quality, we selected only single-face images. The Blurry subcategory contains images whose quality ranges from acceptable to very bad (unclear images). The High-quality subcategory contains images of perfect quality.
We also created a category based on the poses in the images. The Side subgroup contains images of a person captured from one side (side pose), and the Front subgroup includes images of a person captured from the front (front pose).
The Skin-tone category contains images of people classified based on their skin-tone (Black, Asian, and Caucasian). We selected only single-face images.

Evaluation metrics

In all the performance tables presented on our website, each class's confusion matrix (Male and Female) was used. In addition to that, we computed the following evaluation metrics: F1, precision, recall, and coverage.

Confusion matrix for "Male" class:

	True gender
Predicted gender	True gender = Male & Pred gender = Male	True gender = Female & Pred gender = Male
Predicted gender	True gender = Male & Pred gender = Female	True gender = Female & Pred gender = Female

$$Precision_M = {(True\ gender = Male\ \&\ Pred\ gender = Male) \over ((True\ gender = Male\ \&\ Pred\ gender = Male) + (True\ gender = Female\ \&\ Pred\ gender = Male))}$$

$$Recall_M = {(True\ gender = Male\ \&\ Pred\ gender = Male) \over ((True\ gender = Male\ \&\ Pred\ gender = Male) + (True\ gender = Male\ \&\ Pred\ gender = Female))}$$

Confusion matrix for "Female" class:

	True gender
Predicted gender	True gender = Female & Pred gender = Female	True gender = Male & Pred gender = Female
Predicted gender	True gender = Female & Pred gender = Male	True gender = Male & Pred gender = Male

$$Precision_F = {(True\ gender = Female\ \&\ Pred\ gender = Female) \over ((True\ gender = Female\ \&\ Pred\ gender = Female) + (True\ gender = Male\ \&\ Pred\ gender = Female))}$$

$$Recall_F = {(True\ gender = Female\ \&\ Pred\ gender = Female) \over ((True\ gender = Female\ \&\ Pred\ gender = Female) + (True\ gender = Female\ \&\ Pred\ gender = Male))}$$

The F1-score is calculated as follows:

$$F_1 = {2*Precision*Recall \over Precision + Recall}$$

The Coverage metric represents the proportion of a data set for which the given method returns a prediction.

More about performance metrics you can read on Wikipedia.