Facial Affect Recognition based on Multi Architecture Encoder and Feature Fusion for the ABAW7 Challenge

Kang Shen,Xuxiong Liu,Boyan Wang,Jun Yao,Xin Liu,Yujie Guan,Yu Wang,Gengchen Li,Xiao Sun
2024-07-26
Abstract:In this paper, we present our approach to addressing the challenges of the 7th ABAW competition. The competition comprises three sub-challenges: Valence Arousal (VA) estimation, Expression (Expr) classification, and Action Unit (AU) detection. To tackle these challenges, we employ state-of-the-art models to extract powerful visual features. Subsequently, a Transformer Encoder is utilized to integrate these features for the VA, Expr, and AU sub-challenges. To mitigate the impact of varying feature dimensions, we introduce an affine module to align the features to a common dimension. Overall, our results significantly outperform the baselines.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are three main challenges in Facial Affect Recognition, which come from the 7th ABAW (Affective Behavior Analysis in - the - Wild) competition. Specifically, these three sub - challenges include: 1. **Valence - Arousal (VA) Estimation**: Accurately determine positive and negative emotions and their activation levels. 2. **Expression (Expression, Expr) Classification**: Recognize six basic expressions (anger, disgust, fear, happiness, sadness, surprise) as well as the neutral expression. 3. **Action Unit (Action Unit, AU) Detection**: Analyze facial muscle movements to capture subtle facial gestures and decode complex emotional expressions. To solve these problems, the author adopts advanced deep - learning models to extract powerful visual features and uses Transformer encoders to fuse these features together to address the VA, Expr, and AU sub - challenges. In addition, to mitigate the impact of different feature dimensions, an affine module is introduced to align the features to a common dimension. ### Overview of Specific Methods 1. **Feature Extraction**: - Use the pre - trained ResNet - 18 model to extract 512 - dimensional visual features from images. - Use POSTER and POSTER2 networks to extract 768 - dimensional visual features from videos. - Use the OpenFace framework to extract 17 - dimensional features from facial action units (FAU). 2. **Feature Alignment**: - Design an affine module, convert features of different dimensions to a unified dimension through a linear layer, and add position encoding (PE) to convey contextual time information. 3. **Feature Fusion and Encoding**: - Concatenate the aligned features and input them into the Transformer encoder to simulate the time relationship. - The output of the Transformer encoder is further passed to the output layer to obtain the final prediction result. 4. **Loss Function**: - For VA analysis, use the mean squared error (MSE) and CCC loss. - For expression classification, use the cross - entropy (Cross - Entropy) loss. - For AU detection, use the weighted asymmetric loss (Weighted Asymmetric Loss). ### Main Contributions 1. **Efficient Expression Feature Extractor**: By optimizing large - scale facial expression data sets, an efficient facial expression feature extractor has been successfully constructed. 2. **Multi - modal Fusion Model**: Introduce a Transformer - based multi - modal fusion model, which effectively promotes the complementarity and fusion between different modal data. 3. **Ensemble Learning Strategy**: Adopt an ensemble learning strategy to improve the accuracy and generalization ability of emotion analysis in different scenarios. In conclusion, the method proposed in this paper significantly outperforms the baseline model on multiple benchmark data sets, demonstrating its innovation and effectiveness in the field of facial emotion recognition.