Abstract:In this paper, we present our approach to addressing the challenges of the 7th ABAW competition. The competition comprises three sub-challenges: Valence Arousal (VA) estimation, Expression (Expr) classification, and Action Unit (AU) detection. To tackle these challenges, we employ state-of-the-art models to extract powerful visual features. Subsequently, a Transformer Encoder is utilized to integrate these features for the VA, Expr, and AU sub-challenges. To mitigate the impact of varying feature dimensions, we introduce an affine module to align the features to a common dimension. Overall, our results significantly outperform the baselines.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are three main challenges in Facial Affect Recognition, which come from the 7th ABAW (Affective Behavior Analysis in - the - Wild) competition. Specifically, these three sub - challenges include: 1. **Valence - Arousal (VA) Estimation**: Accurately determine positive and negative emotions and their activation levels. 2. **Expression (Expression, Expr) Classification**: Recognize six basic expressions (anger, disgust, fear, happiness, sadness, surprise) as well as the neutral expression. 3. **Action Unit (Action Unit, AU) Detection**: Analyze facial muscle movements to capture subtle facial gestures and decode complex emotional expressions. To solve these problems, the author adopts advanced deep - learning models to extract powerful visual features and uses Transformer encoders to fuse these features together to address the VA, Expr, and AU sub - challenges. In addition, to mitigate the impact of different feature dimensions, an affine module is introduced to align the features to a common dimension. ### Overview of Specific Methods 1. **Feature Extraction**: - Use the pre - trained ResNet - 18 model to extract 512 - dimensional visual features from images. - Use POSTER and POSTER2 networks to extract 768 - dimensional visual features from videos. - Use the OpenFace framework to extract 17 - dimensional features from facial action units (FAU). 2. **Feature Alignment**: - Design an affine module, convert features of different dimensions to a unified dimension through a linear layer, and add position encoding (PE) to convey contextual time information. 3. **Feature Fusion and Encoding**: - Concatenate the aligned features and input them into the Transformer encoder to simulate the time relationship. - The output of the Transformer encoder is further passed to the output layer to obtain the final prediction result. 4. **Loss Function**: - For VA analysis, use the mean squared error (MSE) and CCC loss. - For expression classification, use the cross - entropy (Cross - Entropy) loss. - For AU detection, use the weighted asymmetric loss (Weighted Asymmetric Loss). ### Main Contributions 1. **Efficient Expression Feature Extractor**: By optimizing large - scale facial expression data sets, an efficient facial expression feature extractor has been successfully constructed. 2. **Multi - modal Fusion Model**: Introduce a Transformer - based multi - modal fusion model, which effectively promotes the complementarity and fusion between different modal data. 3. **Ensemble Learning Strategy**: Adopt an ensemble learning strategy to improve the accuracy and generalization ability of emotion analysis in different scenarios. In conclusion, the method proposed in this paper significantly outperforms the baseline model on multiple benchmark data sets, demonstrating its innovation and effectiveness in the field of facial emotion recognition.

Facial Affect Recognition based on Multi Architecture Encoder and Feature Fusion for the ABAW7 Challenge

Facial Affect Recognition based on Transformer Encoder and Audiovisual Fusion for the ABAW5 Challenge

Transformer-based Multimodal Information Fusion for Facial Expression Analysis

Emotion Recognition in Videos via Fusing Multimodal Features.

Facial Expression Recognition Based on Multi-modal Features for Videos in the Wild

Multi-modal Facial Action Unit Detection with Large Pre-trained Models for the 5th Competition on Affective Behavior Analysis in-the-wild

Multi-modal Facial Affective Analysis based on Masked Autoencoder

An Effective Ensemble Learning Framework for Affective Behaviour Analysis

HSEmotion Team at the 7th ABAW Challenge: Multi-Task Learning and Compound Facial Expression Recognition

A Unified Approach to Facial Affect Analysis: the MAE-Face Visual Representation.

Multi-Task Learning for Emotion Descriptors Estimation at the fourth ABAW Challenge

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Multi-modal Expression Recognition with Ensemble Method

Facial Affective Behavior Analysis Method for 5th ABAW Competition

Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos with Transformers

Facial Affect Analysis: Learning from Synthetic Data & Multi-Task Learning Challenges

Affective Behaviour Analysis via Integrating Multi-Modal Knowledge

Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation

Compound Expression Recognition via Multi Model Ensemble for the ABAW7 Challenge

ABAW : Facial Expression Recognition in the wild

Emotic Masked Autoencoder with Attention Fusion for Facial Expression Recognition