Understanding Fairness of Gender Classification Algorithms Across Gender-Race Groups

Anoop Krishnan,Ali Almadan,Ajita Rattani
DOI: https://doi.org/10.48550/arXiv.2009.11491
2020-09-24
Abstract:Automated gender classification has important applications in many domains, such as demographic research, law enforcement, online advertising, as well as human-computer interaction. Recent research has questioned the fairness of this technology across gender and race. Specifically, the majority of the studies raised the concern of higher error rates of the face-based gender classification system for darker-skinned people like African-American and for women. However, to date, the majority of existing studies were limited to African-American and Caucasian only. The aim of this paper is to investigate the differential performance of the gender classification algorithms across gender-race groups. To this aim, we investigate the impact of (a) architectural differences in the deep learning algorithms and (b) training set imbalance, as a potential source of bias causing differential performance across gender and race. Experimental investigations are conducted on two latest large-scale publicly available facial attribute datasets, namely, UTKFace and FairFace. The experimental results suggested that the algorithms with architectural differences varied in performance with consistency towards specific gender-race groups. For instance, for all the algorithms used, Black females (Black race in general) always obtained the least accuracy rates. Middle Eastern males and Latino females obtained higher accuracy rates most of the time. Training set imbalance further widens the gap in the unequal accuracy rates across all gender-race groups. Further investigations using facial landmarks suggested that facial morphological differences due to the bone structure influenced by genetic and environmental factors could be the cause of the least performance of Black females and Black race, in general.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the fairness issue of automatic gender classification systems in different gender - race groups. Specifically, the paper focuses on: 1. **The impact of algorithm architecture differences**: Research the impact of different deep - learning algorithm architectures on gender classification performance, especially how these differences lead to performance differences between different gender - race groups. 2. **The impact of training set imbalance**: Explore how the uneven distribution of gender and race in the training data set affects the fairness of gender classification algorithms. 3. **The impact of facial morphological differences**: Analyze whether the differences in facial morphological features (such as skeletal structure) are the reasons for the lower classification accuracy of some groups (especially black women). ### Research Background Automatic gender classification technology has important applications in multiple fields, such as demographic research, law enforcement, online advertising, and human - computer interaction. However, recent research has pointed out that the performance of this technology in different gender and race groups is unfair, especially for people with darker skin (such as African - Americans) and women, with a higher error rate. Existing research mainly focuses on African - Americans and whites, lacking a broad assessment of other race groups. ### Research Objectives The paper aims to deeply study the performance differences of gender classification algorithms in different gender - race groups in the following ways: - **Algorithm architecture differences**: Analyze the impact of different deep - learning algorithm architectures on performance. - **Training set imbalance**: Evaluate the impact of the uneven distribution of gender and race in the training data set on algorithm performance. - **Facial morphological features**: Research whether the differences in facial morphological features (such as skeletal structure) lead to lower classification accuracy in some groups. ### Experimental Design - **Data set**: Use two large - scale publicly available facial attribute data sets (UTKFace and FairFace) for experiments. - **Model**: Use multiple deep - learning models (such as VGG, ResNet, and InceptionNet) for gender classification and fine - tune these models. - **Evaluation metrics**: In addition to the overall accuracy rate, also analyze the false positive rate and false negative rate to more comprehensively evaluate the fairness of the algorithm. ### Main Findings - **Algorithm architecture differences**: There are significant differences in the performance of different algorithm architectures in specific gender - race groups. For example, all algorithms have the lowest classification accuracy for black women, while the classification accuracy for Middle - Eastern men and Latina women is relatively high. - **Training set imbalance**: The imbalance of the training data set further exacerbates the performance differences between different gender - race groups. Especially when the training set is biased towards men, all models generally have a higher classification accuracy for men than for women. - **Facial morphological features**: The differences in facial morphological features (affected by genetic and environmental factors) may be one of the reasons for the lower classification accuracy of black women. ### Conclusion Through systematic experiments and analyses, the paper reveals the performance differences of gender classification algorithms in different gender - race groups and the reasons behind them. The research results emphasize the importance of considering fairness in the design and development of gender classification systems and provide valuable insights for further reducing algorithmic biases.