Abstract:Compound Expression Recognition (CER) is vital for effective interpersonal interactions. Human emotional expressions are inherently complex due to the presence of compound expressions, requiring the consideration of both local and global facial cues for accurate judgment. In this paper, we propose an ensemble learning-based solution to address this complexity. Our approach involves training three distinct expression classification models using convolutional networks, Vision Transformers, and multiscale local attention networks. By employing late fusion for model ensemble, we combine the outputs of these models to predict the final results. Our method demonstrates high accuracy on the RAF-DB datasets and is capable of recognizing expressions in certain portions of the C-EXPR-DB through zero-shot learning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the recognition problem of compound emotional expression (Compound Expression Recognition, CER). Specifically, traditional emotional expression recognition techniques are usually limited to classifying six basic facial expressions (such as anger, happiness, sadness, surprise, disgust, and fear). However, in real life, human emotional expressions are far more complex than these predefined categories and often contain combinations of two or more basic emotions, such as "terror", "surprise", "bittersweet", etc. To meet this challenge, this paper proposes a solution based on multi - model ensemble learning, aiming to more accurately recognize compound emotional expressions by combining local and global facial cues. This method involves training three different expression classification models: convolutional neural network (CNN), Vision Transformer, and Multiscale Local Attention Networks. By adopting a late - fusion strategy, the outputs of these models are combined to predict the final result. The following are the specific methods and technical details proposed in the paper: 1. **Feature Extraction**: - Use ResNet50 as the convolutional neural network model, focusing on capturing local features of facial expressions. - Use ViT (Vision Transformer) to extract features from images and effectively capture global information of facial expressions through the self - attention mechanism. - Use a multi - layer perceptron (MLP) to fuse the features extracted by the above two models, taking advantage of the complementarity of local and global features. 2. **Dataset and Experimental Setup**: - Use two commonly used expression recognition datasets, RAF - DB and C - EXPR - DB, for training and validating the model. - The evaluation metric is the F1 score, which is used to measure the prediction accuracy of the model for seven compound expressions. 3. **Model Integration**: - Use a batch of data images \( x \) as input, where \( X\in R^{B\times3\times H\times W} \), \( B \) represents the batch size, 3 represents the RGB channels, and \( H \) and \( W \) represent the height and width of the image respectively. - The extracted features are represented as: \[ \text{feature}_1=\text{PosterV2}(x)\in R^{B\times768} \] \[ \text{feature}_2 = \text{ResNet}(x)\in R^{B\times2048} \] - Concatenate these features and input them into a multi - layer perceptron (MLP), and finally apply the softmax function to calculate the logits of seven compound expressions: \[ \text{feature}=[\text{feature}_1;\text{feature}_2] \] \[ \text{logit}=\text{softmax}(\text{MLP}(\text{feature})) \] Through this multi - model fusion method, the author aims to improve the accuracy and robustness of compound emotional expression recognition. The experimental results show that the integrated model performs better than a single model in the recognition of multiple compound expressions, especially achieving a significant improvement in some difficult - to - recognize expressions.

Compound Expression Recognition via Multi Model Ensemble for the ABAW7 Challenge

Compound Expression Recognition via Multi Model Ensemble

Learning Transferable Compound Expressions from Masked AutoEncoder Pretraining

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Multi-modal Expression Recognition with Ensemble Method

Zero-shot Compound Expression Recognition with Visual Language Model at the 6th ABAW Challenge

Compound facial expressions recognition approach using DCGAN and CNN

Audio-Visual Compound Expression Recognition Method based on Late Modality Fusion and Rule-based Decision

Facial Affect Recognition based on Multi Architecture Encoder and Feature Fusion for the ABAW7 Challenge

Affective Behaviour Analysis via Progressive Learning

HSEmotion Team at the 7th ABAW Challenge: Multi-Task Learning and Compound Facial Expression Recognition

Textualized and Feature-based Models for Compound Multimodal Emotion Recognition in the Wild

7th ABAW Competition: Multi-Task Learning and Compound Expression Recognition

Combining 2D Gabor and Local Binary Pattern for Facial Expression Recognition Using Extreme Learning Machine

Facial Expression Recognition Based on Multi-modal Features for Videos in the Wild

Facial Emotion Recognition Combining Auxiliary Classifiers and Multiscale CBAM Attention Mechanisms

An Effective Ensemble Learning Framework for Affective Behaviour Analysis

A Region Group Adaptive Attention Model for Subtle Expression Recognition

ExpLLM: Towards Chain of Thought for Facial Expression Recognition

Adaptively Learning Facial Expression Representation via C-F Labels and Distillation

Evaluation and analysis of visual perception using attention-enhanced computation in multimedia affective computing