SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

Xilin He,Cheng Luo,Xiaole Xian,Bing Li,Siyang Song,Muhammad Haris Khan,Weicheng Xie,Linlin Shen,Zongyuan Ge
2024-10-13
Abstract:Facial expression datasets remain limited in scale due to privacy concerns, the subjectivity of annotations, and the labor-intensive nature of data collection. This limitation poses a significant challenge for developing modern deep learning-based facial expression analysis models, particularly foundation models, that rely on large-scale data for optimal performance. To tackle the overarching and complex challenge, we introduce SynFER (Synthesis of Facial Expressions with Refined Control), a novel framework for synthesizing facial expression image data based on high-level textual descriptions as well as more fine-grained and precise control through facial action units. To ensure the quality and reliability of the synthetic data, we propose a semantic guidance technique to steer the generation process and a pseudo-label generator to help rectify the facial expression labels for the synthetic images. To demonstrate the generation fidelity and the effectiveness of the synthetic data from SynFER, we conduct extensive experiments on representation learning using both synthetic data and real-world data. Experiment results validate the efficacy of the proposed approach and the synthetic data. Notably, our approach achieves a 67.23% classification accuracy on AffectNet when training solely with synthetic data equivalent to the AffectNet training set size, which increases to 69.84% when scaling up to five times the original size. Our code will be made publicly available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limited scale and insufficient quality of existing datasets in Facial Expression Recognition (FER). Specifically: 1. **Problems of dataset scale and quality**: Existing facial expression datasets (such as CK+, FER - 2013, RAF - DB, AFEW and SFEW, etc.) are relatively small, and there is a large gap in scale compared with general image - processing datasets (such as ImageNet). In addition, large - scale datasets like AffectNet contain a large number of facial images, but have problems such as low image quality and inaccurate labeling. 2. **Requirements of deep - learning models**: Modern deep - learning - based facial expression analysis models, especially foundation models, rely on large - scale and high - quality data to achieve optimal performance. However, the existing FER datasets cannot meet these requirements, which limits the development and performance improvement of the models. To solve these problems, the author proposes a new framework named SynFER (Synthesis of Facial Expressions with Refined Control) for generating facial expression image data based on high - level text descriptions and more refined Facial Action Units (FAUs) control. By introducing synthetic data to enhance the quantity and quality of the training data of FER models, the progress of FER technology is promoted. ### Main contributions of SynFER: 1. **FEText dataset**: A new mixed image - text dataset FEText is created, which is specifically used for facial expression tasks and contains 400,000 carefully curated image - text pairs. 2. **SynFER framework**: The first FER data synthesis pipeline based on the diffusion model is proposed, which combines FAU information and semantic guidance to achieve fine - grained control and realistic facial expression generation. 3. **FERAnno label calibrator**: A diffusion - model - based label calibrator FERAnno is developed to automatically generate reliable annotations for synthetic facial expression images. Through these innovations, SynFER can generate large - scale, high - quality facial expression data, significantly improving the performance of FER models under various learning paradigms. Experimental results show that when only using synthetic data for training, the classification accuracy of SynFER on AffectNet reaches 67.23%, and when expanded to five times the size of the original dataset, the accuracy is further improved to 69.84%.