Compound Expression Recognition via Multi Model Ensemble for the ABAW7 Challenge

Xuxiong Liu,Kang Shen,Jun Yao,Boyan Wang,Minrui Liu,Liuwei An,Zishun Cui,Weijie Feng,Xiao Sun
2024-07-26
Abstract:Compound Expression Recognition (CER) is vital for effective interpersonal interactions. Human emotional expressions are inherently complex due to the presence of compound expressions, requiring the consideration of both local and global facial cues for accurate judgment. In this paper, we propose an ensemble learning-based solution to address this complexity. Our approach involves training three distinct expression classification models using convolutional networks, Vision Transformers, and multiscale local attention networks. By employing late fusion for model ensemble, we combine the outputs of these models to predict the final results. Our method demonstrates high accuracy on the RAF-DB datasets and is capable of recognizing expressions in certain portions of the C-EXPR-DB through zero-shot learning.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the recognition problem of compound emotional expression (Compound Expression Recognition, CER). Specifically, traditional emotional expression recognition techniques are usually limited to classifying six basic facial expressions (such as anger, happiness, sadness, surprise, disgust, and fear). However, in real life, human emotional expressions are far more complex than these predefined categories and often contain combinations of two or more basic emotions, such as "terror", "surprise", "bittersweet", etc. To meet this challenge, this paper proposes a solution based on multi - model ensemble learning, aiming to more accurately recognize compound emotional expressions by combining local and global facial cues. This method involves training three different expression classification models: convolutional neural network (CNN), Vision Transformer, and Multiscale Local Attention Networks. By adopting a late - fusion strategy, the outputs of these models are combined to predict the final result. The following are the specific methods and technical details proposed in the paper: 1. **Feature Extraction**: - Use ResNet50 as the convolutional neural network model, focusing on capturing local features of facial expressions. - Use ViT (Vision Transformer) to extract features from images and effectively capture global information of facial expressions through the self - attention mechanism. - Use a multi - layer perceptron (MLP) to fuse the features extracted by the above two models, taking advantage of the complementarity of local and global features. 2. **Dataset and Experimental Setup**: - Use two commonly used expression recognition datasets, RAF - DB and C - EXPR - DB, for training and validating the model. - The evaluation metric is the F1 score, which is used to measure the prediction accuracy of the model for seven compound expressions. 3. **Model Integration**: - Use a batch of data images \( x \) as input, where \( X\in R^{B\times3\times H\times W} \), \( B \) represents the batch size, 3 represents the RGB channels, and \( H \) and \( W \) represent the height and width of the image respectively. - The extracted features are represented as: \[ \text{feature}_1=\text{PosterV2}(x)\in R^{B\times768} \] \[ \text{feature}_2 = \text{ResNet}(x)\in R^{B\times2048} \] - Concatenate these features and input them into a multi - layer perceptron (MLP), and finally apply the softmax function to calculate the logits of seven compound expressions: \[ \text{feature}=[\text{feature}_1;\text{feature}_2] \] \[ \text{logit}=\text{softmax}(\text{MLP}(\text{feature})) \] Through this multi - model fusion method, the author aims to improve the accuracy and robustness of compound emotional expression recognition. The experimental results show that the integrated model performs better than a single model in the recognition of multiple compound expressions, especially achieving a significant improvement in some difficult - to - recognize expressions.