Multi-modal Expression Recognition with Ensemble Method

Chuanhe Liu,Xinjie Zhang,Xiaolong Liu,Tenggan Zhang,Liyu Meng,Yuchen Liu,Yuanyuan Deng,Wenqiang Jiang
2023-03-17
Abstract:This paper presents our submission to the Expression Classification Challenge of the fifth Affective Behavior Analysis in-the-wild (ABAW) Competition. In our method, multimodal feature combinations extracted by several different pre-trained models are applied to capture more effective emotional information. For these combinations of visual and audio modal features, we utilize two temporal encoders to explore the temporal contextual information in the data. In addition, we employ several ensemble strategies for different experimental settings to obtain the most accurate expression recognition results. Our system achieves the average F1 Score of 0.45774 on the validation set.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of multimodal recognition of facial expressions in the wild (i.e., in natural scenes). Specifically, the researchers participated in the 5th Affective Behavior Analysis in the Wild (ABAW) competition's expression classification challenge. They proposed a method that utilizes multiple pre-trained models to extract visual and audio features and captures temporal context information in the data through different temporal encoders. Additionally, they employed various ensemble strategies to achieve the most accurate expression recognition results. ### Main Contributions of the Paper: 1. **Multimodal Feature Combination**: Combined visual and audio features extracted from multiple pre-trained models to capture more effective expression information. 2. **Temporal Encoders**: Used LSTM and Transformer as two temporal encoders to explore the temporal context information in the data. 3. **Ensemble Strategies**: Adopted various ensemble strategies to improve the accuracy of expression recognition. ### Experimental Results: - On the validation set, the method achieved an average F1 score of 0.45774. - Through different feature combinations and model structures, the experimental results show that multimodal features and ensemble strategies can significantly enhance recognition performance. ### Summary: The paper aims to improve the accuracy of facial expression recognition in natural scenes through multimodal feature extraction and temporal context modeling. By using multiple pre-trained models and ensemble strategies, the researchers achieved good results in the expression classification challenge.