Abstract:Technologies for recognizing facial attributes like race, gender, age, and emotion have several applications, such as surveillance, advertising content, sentiment analysis, and the study of demographic trends and social behaviors. Analyzing demographic characteristics based on images and analyzing facial expressions have several challenges due to the complexity of humans' facial attributes. Traditional approaches have employed CNNs and various other deep learning techniques, trained on extensive collections of labeled images. While these methods demonstrated effective performance, there remains potential for further enhancements. In this paper, we propose to utilize vision language models (VLMs) such as generative pre-trained transformer (GPT), GEMINI, large language and vision assistant (LLAVA), PaliGemma, and Microsoft Florence2 to recognize facial attributes such as race, gender, age, and emotion from images with human faces. Various datasets like FairFace, AffectNet, and UTKFace have been utilized to evaluate the solutions. The results show that VLMs are competitive if not superior to traditional techniques. Additionally, we propose "FaceScanPaliGemma"--a fine-tuned PaliGemma model--for race, gender, age, and emotion recognition. The results show an accuracy of 81.1%, 95.8%, 80%, and 59.4% for race, gender, age group, and emotion classification, respectively, outperforming pre-trained version of PaliGemma, other VLMs, and SotA methods. Finally, we propose "FaceScanGPT", which is a GPT-4o model to recognize the above attributes when several individuals are present in the image using a prompt engineered for a person with specific facial and/or physical attributes. The results underscore the superior multitasking capability of FaceScanGPT to detect the individual's attributes like hair cut, clothing color, postures, etc., using only a prompt to drive the detection and recognition tasks.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the accuracy and efficiency of facial attribute recognition (such as race, gender, age, and emotion). Specifically, the author aims to use Vision Language Models (VLMs) to simultaneously recognize multiple facial attributes and overcome the limitations of traditional methods in handling complex and diverse facial data.
### Main problems:
1. **Multi - task classification challenges**: Traditional facial attribute recognition usually requires training a model separately for each attribute (such as race, gender, age, and emotion), which not only increases the demand for computing resources but may also lead to redundant learning.
2. **Dataset bias**: Existing facial attribute datasets may have problems such as sample imbalance, pose variation, and differences in lighting conditions, resulting in poor performance of the model in practical applications.
3. **Model performance improvement**: Although Convolutional Neural Networks (CNNs) and other deep - learning techniques have achieved good results in facial attribute recognition, there is still room for further improvement, especially in multi - task classification and zero - shot learning.
4. **Ethical issues**: Facial attribute recognition involves sensitive human attributes (such as race, gender, etc.), so it is necessary to ensure the fairness, transparency, and interpretability of the model and avoid bias and discrimination.
### Solutions:
To address the above challenges, the author proposes the following solutions:
1. **Using Vision Language Models (VLMs)**: The author introduced several advanced Vision Language Models, such as GPT, GEMINI, LLAVA, PaliGemma, and Florence - 2. These models have strong multi - modal processing capabilities and can process image and text information simultaneously.
2. **Multi - task learning**: Through the multi - task learning framework, the author designed a model that can simultaneously recognize multiple facial attributes, reducing the need for multiple independent models and improving computational efficiency.
3. **Zero - shot classification**: Explore and evaluate the performance of VLMs in zero - shot classification tasks, that is, classification without specific category labels, demonstrating the strong generalization ability of VLMs.
4. **Dataset optimization**: Use multiple public datasets such as FairFace, AffectNet, and UTKFace for experiments to ensure that the model can perform well on diverse and complex facial data.
5. **Model fine - tuning**: By fine - tuning the PaliGemma model, the author developed a new model named "FaceScanPaliGemma", which achieved higher accuracy in race, gender, age, and emotion classification tasks.
6. **Multi - task processing ability**: Proposed a model based on GPT - 4o - "FaceScanGPT", which is used to recognize faces and physical attributes in images containing multiple individuals, demonstrating its strong multi - task processing ability.
### Experimental results:
The experimental results show that the proposed VLMs perform well in facial attribute recognition tasks, especially superior to traditional methods in multi - task classification and zero - shot classification. Specifically, the accuracy rates of "FaceScanPaliGemma" in race, gender, age, and emotion classification tasks are 81.1%, 95.8%, 80%, and 59.4% respectively, which are significantly better than the pre - trained PaliGemma and other VLMs.
### Conclusion:
This research successfully improved the accuracy and efficiency of facial attribute recognition by introducing Vision Language Models, especially in multi - task classification and zero - shot classification. In addition, the author also emphasized the importance of considering ethical issues in facial attribute recognition to ensure the fairness and transparency of the model.