Abstract:With the proliferation of user-generated videos in online websites, it becomes particularly important to achieve automatic perception and understanding of human emotion/sentiment from these videos. In this paper, we present our solutions to the MuSe-Wilder and MuSe-Sent sub-challenges in MuSe 2021 Multimodal Sentiment Analysis Challenge. MuSe-Wilder focuses on continuous emotion (i.e., arousal and valence) recognition while the task of MuSe-Sent concentrates on discrete sentiment classification. To this end, we first extract a variety of features from three common modalities (i.e., audio, visual, and text), including both low-level handcrafted features and high-level deep representations from supervised/unsupervised pre-trained models. Then, the long short-term memory recurrent neural network, as well as the self-attention mechanism is employed to model the complex temporal dependencies in the feature sequence. The concordance correlation coefficient (CCC) loss and F1-loss are used to guide continuous regression and discrete classification, respectively. To further boost the model's performance, we adopt late fusion to exploit complementary information from different modalities. Our proposed method achieves CCCs of 0.4117 and 0.6649 for arousal and valence respectively on the test set of MuSe-Wilder, which outperforms the baseline system (i.e., 0.3386 and 0.5974) by a large margin. For MuSe-Sent, F1-scores of 0.3614 and 0.4451 for arousal and valence are obtained, which also outperforms the baseline system significantly (i.e., 0.3512 and 0.3291). With these promising results, we ranked top3 in both sub-challenges.

Emotion Recognition Using Multimodal Features

Emotion Recognition in Videos via Fusing Multimodal Features.

Emotion recognition with multimodal features and temporal models.

Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition

Audio Visual Recognition of Spontaneous Emotions In-the-Wild.

Video Emotion Recognition in the Wild Based on Fusion of Multimodal Features

MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016.

Multimodal Facial Expression Recognition Based on Dempster-Shafer Theory Fusion Strategy

Audio-Visual Emotion Recognition with Capsule-like Feature Representation and Model-Based Reinforcement Learning

Multimodal Emotion Recognition Based on Feature Selection and Extreme Learning Machine in Video Clips.

Chinese Multimodal Emotion Recognition in Deep and Traditional Machine Leaming Approaches

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Audio-Video Based Multimodal Emotion Recognition Using Svms and Deep Learning

MEC 2017: Multimodal Emotion Recognition Challenge

Multi-modal Emotion Recognition Based on Deep Learning in Speech, Video and Text

Multi-Modal Multi-Cultural Dimensional Continues Emotion Recognition In Dyadic Interactions

Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model

Video Emotion Recognition using Hand-Crafted and Deep Learning Features

Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos with Transformers

Combining Multimodal Features Within A Fusion Network For Emotion Recognition In The Wild

A robust multimodal approach for emotion recognition