Abstract:This paper presents our submission to the Expression Classification Challenge of the fifth Affective Behavior Analysis in-the-wild (ABAW) Competition. In our method, multimodal feature combinations extracted by several different pre-trained models are applied to capture more effective emotional information. For these combinations of visual and audio modal features, we utilize two temporal encoders to explore the temporal contextual information in the data. In addition, we employ several ensemble strategies for different experimental settings to obtain the most accurate expression recognition results. Our system achieves the average F1 Score of 0.45774 on the validation set.

What problem does this paper attempt to address?

The paper attempts to address the problem of multimodal recognition of facial expressions in the wild (i.e., in natural scenes). Specifically, the researchers participated in the 5th Affective Behavior Analysis in the Wild (ABAW) competition's expression classification challenge. They proposed a method that utilizes multiple pre-trained models to extract visual and audio features and captures temporal context information in the data through different temporal encoders. Additionally, they employed various ensemble strategies to achieve the most accurate expression recognition results. ### Main Contributions of the Paper: 1. **Multimodal Feature Combination**: Combined visual and audio features extracted from multiple pre-trained models to capture more effective expression information. 2. **Temporal Encoders**: Used LSTM and Transformer as two temporal encoders to explore the temporal context information in the data. 3. **Ensemble Strategies**: Adopted various ensemble strategies to improve the accuracy of expression recognition. ### Experimental Results: - On the validation set, the method achieved an average F1 score of 0.45774. - Through different feature combinations and model structures, the experimental results show that multimodal features and ensemble strategies can significantly enhance recognition performance. ### Summary: The paper aims to improve the accuracy of facial expression recognition in natural scenes through multimodal feature extraction and temporal context modeling. By using multiple pre-trained models and ensemble strategies, the researchers achieved good results in the expression classification challenge.

Multi-modal Expression Recognition with Ensemble Method

Facial Expression Recognition Based on Multi-modal Features for Videos in the Wild

Emotion Recognition in Videos via Fusing Multimodal Features.

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Facial Affect Recognition based on Transformer Encoder and Audiovisual Fusion for the ABAW5 Challenge

An Effective Ensemble Learning Framework for Affective Behaviour Analysis

Transformer-based Multimodal Information Fusion for Facial Expression Analysis

Multi-Task Learning Framework for Emotion Recognition In-the-Wild

Valence and Arousal Estimation Based on Multimodal Temporal-Aware Features for Videos in the Wild

Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos with Transformers

Multi-modal Facial Action Unit Detection with Large Pre-trained Models for the 5th Competition on Affective Behavior Analysis in-the-wild

Emotion recognition with multimodal features and temporal models.

Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition

Combining Multimodal Features Within A Fusion Network For Emotion Recognition In The Wild

Multi-model Ensemble Learning Method for Human Expression Recognition

Multi-modal Facial Affective Analysis based on Masked Autoencoder

Multi-View Common Space Learning For Emotion Recognition In The Wild

Ensemble System for Multimodal Emotion Recognition Challenge (MEC 2017)

Mutilmodal Feature Extraction and Attention-based Fusion for Emotion Estimation in Videos

Facial Affect Recognition based on Multi Architecture Encoder and Feature Fusion for the ABAW7 Challenge

Multimodal Facial Expression Recognition Based on Dempster-Shafer Theory Fusion Strategy