Abstract:Technologies for recognizing facial attributes like race, gender, age, and emotion have several applications, such as surveillance, advertising content, sentiment analysis, and the study of demographic trends and social behaviors. Analyzing demographic characteristics based on images and analyzing facial expressions have several challenges due to the complexity of humans' facial attributes. Traditional approaches have employed CNNs and various other deep learning techniques, trained on extensive collections of labeled images. While these methods demonstrated effective performance, there remains potential for further enhancements. In this paper, we propose to utilize vision language models (VLMs) such as generative pre-trained transformer (GPT), GEMINI, large language and vision assistant (LLAVA), PaliGemma, and Microsoft Florence2 to recognize facial attributes such as race, gender, age, and emotion from images with human faces. Various datasets like FairFace, AffectNet, and UTKFace have been utilized to evaluate the solutions. The results show that VLMs are competitive if not superior to traditional techniques. Additionally, we propose "FaceScanPaliGemma"--a fine-tuned PaliGemma model--for race, gender, age, and emotion recognition. The results show an accuracy of 81.1%, 95.8%, 80%, and 59.4% for race, gender, age group, and emotion classification, respectively, outperforming pre-trained version of PaliGemma, other VLMs, and SotA methods. Finally, we propose "FaceScanGPT", which is a GPT-4o model to recognize the above attributes when several individuals are present in the image using a prompt engineered for a person with specific facial and/or physical attributes. The results underscore the superior multitasking capability of FaceScanGPT to detect the individual's attributes like hair cut, clothing color, postures, etc., using only a prompt to drive the detection and recognition tasks.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the accuracy and efficiency of facial attribute recognition (such as race, gender, age, and emotion). Specifically, the author aims to use Vision Language Models (VLMs) to simultaneously recognize multiple facial attributes and overcome the limitations of traditional methods in handling complex and diverse facial data. ### Main problems: 1. **Multi - task classification challenges**: Traditional facial attribute recognition usually requires training a model separately for each attribute (such as race, gender, age, and emotion), which not only increases the demand for computing resources but may also lead to redundant learning. 2. **Dataset bias**: Existing facial attribute datasets may have problems such as sample imbalance, pose variation, and differences in lighting conditions, resulting in poor performance of the model in practical applications. 3. **Model performance improvement**: Although Convolutional Neural Networks (CNNs) and other deep - learning techniques have achieved good results in facial attribute recognition, there is still room for further improvement, especially in multi - task classification and zero - shot learning. 4. **Ethical issues**: Facial attribute recognition involves sensitive human attributes (such as race, gender, etc.), so it is necessary to ensure the fairness, transparency, and interpretability of the model and avoid bias and discrimination. ### Solutions: To address the above challenges, the author proposes the following solutions: 1. **Using Vision Language Models (VLMs)**: The author introduced several advanced Vision Language Models, such as GPT, GEMINI, LLAVA, PaliGemma, and Florence - 2. These models have strong multi - modal processing capabilities and can process image and text information simultaneously. 2. **Multi - task learning**: Through the multi - task learning framework, the author designed a model that can simultaneously recognize multiple facial attributes, reducing the need for multiple independent models and improving computational efficiency. 3. **Zero - shot classification**: Explore and evaluate the performance of VLMs in zero - shot classification tasks, that is, classification without specific category labels, demonstrating the strong generalization ability of VLMs. 4. **Dataset optimization**: Use multiple public datasets such as FairFace, AffectNet, and UTKFace for experiments to ensure that the model can perform well on diverse and complex facial data. 5. **Model fine - tuning**: By fine - tuning the PaliGemma model, the author developed a new model named "FaceScanPaliGemma", which achieved higher accuracy in race, gender, age, and emotion classification tasks. 6. **Multi - task processing ability**: Proposed a model based on GPT - 4o - "FaceScanGPT", which is used to recognize faces and physical attributes in images containing multiple individuals, demonstrating its strong multi - task processing ability. ### Experimental results: The experimental results show that the proposed VLMs perform well in facial attribute recognition tasks, especially superior to traditional methods in multi - task classification and zero - shot classification. Specifically, the accuracy rates of "FaceScanPaliGemma" in race, gender, age, and emotion classification tasks are 81.1%, 95.8%, 80%, and 59.4% respectively, which are significantly better than the pre - trained PaliGemma and other VLMs. ### Conclusion: This research successfully improved the accuracy and efficiency of facial attribute recognition by introducing Vision Language Models, especially in multi - task classification and zero - shot classification. In addition, the author also emphasized the importance of considering ethical issues in facial attribute recognition to ensure the fairness and transparency of the model.

Exploring Vision Language Models for Facial Attribute Recognition: Emotion, Race, Gender, and Age

Real-Time Svm-Based Emotion Recognition Algorithm

A Deep Learning Approach for Recognizing Age, Emotion and Gender in Facial Expressions

Face Image Analysis using AAM, Gabor, LBP and WD features for Gender, Age, Expression and Ethnicity Classification

Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition

PERSONA: An Application for Emotion Recognition, Gender Recognition and Age Estimation

Facial emotion recognition using geometrical features based deep learning techniques

Human Emotion Recognition Based on Spatio-Temporal Facial Features Using HOG-HOF and VGG-LSTM

Leveraging vision-language models for fair facial attribute classification

Gender-specific Facial Age Group Classification Using Deep Learning

Facial Emotion Recognition: A multi-task approach using deep learning

Deep learning for identification and face, gender, expression recognition under constraints

EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

A novel facial emotion recognition model using segmentation VGG-19 architecture

Facial Emotion Recognition for Mobile Devices: A Practical Review

Towards In-Vehicle Multi-Task Facial Attribute Recognition: Investigating Synthetic Data and Vision Foundation Models

Facial Emotions Recognition Using Deep Learning Technology

A real time face emotion classification and recognition using deep learning model

Korean Facial Expression Emotion Recognition based on Image Meta Information

Contextual Emotion Recognition using Large Vision Language Models

Enhancing Facial Emotion Recognition with a Modified Deep Convolutional Neural Network