Multi-modal Fusion for Video Sentiment Analysis

Ruichen Li,Jinming Zhao,Jingwen Hu,Shuai Guo,Qin Jin
DOI: https://doi.org/10.1145/3423327.3423671
2020-01-01
Abstract:Automatic sentiment analysis can support revealing a subject's emotional state and opinion tendency toward an entity. In this paper, we present our solutions for the MuSe-Wild sub-challenge of Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020. The videos in this challenge are collected from YouTube about emotional car reviews. In the scenarios, the speaker's sentiment can be conveyed in different modalities including acoustic, visual, and textual modalities. Due to the complementarity of different modalities, the fusion of the multiple modalities has a large impact on sentiment analysis. In this paper, we highlight two aspects of our solutions: 1) we explore various low-level and high-level features from different modalities for emotional state recognition, such as expert-defined low-level descriptors (LLD) and deep learned features, etc. 2) we propose several effective multi-modal fusion strategies to make full use of the different modalities. Our solutions achieve the best CCC performance of 0.4346 and 0.4513 on arousal and valence respectively on the challenge testing set, which significantly outperforms the baseline system with corresponding CCC of 0.2843 and 0.2413 on arousal and valence. The experimental results show that our proposed various effective representations of different modalities and fusion strategies have a strong generalization ability and can bring more robust performance.
What problem does this paper attempt to address?