Extracting Method for Fine-Grained Emotional Features in Videos

Cangzhi Zheng,Junjie Peng,Zesu Cai
DOI: https://doi.org/10.1016/j.knosys.2024.112382
IF: 8.139
2024-01-01
Knowledge-Based Systems
Abstract:Multimodal Sentiment Analysis (MSA) has significant applications in social media analysis and healthcare. It utilizes features from multiple modalities e.g., video, general text, acoustics, and vision, to obtain more credible sentiment analysis results. However, previous studies have ignored substantial variations in sub-features emerging within both the acoustic and visual modalities during feature extraction. Instead, they primarily rely on a simple concatenation method to derive representations. Consequently, these approaches prevent feature extractors from effectively leveraging non-verbal (acoustic and visual) features, thereby, constraining the model’s overall performance. To solve this problem, this study proposes a method for extracting fine-grained emotional features from videos. By segregating the initial features of non-verbal modalities into distinct domains, then separately modeling and uniformly re-integrating them, our method effectively exploits the modality-specific information in these original features. Through extensive experiments, the performance of all models significantly improves using the features extracted following our method compared with original ones. This substantiates that our proposed approach more effectively utilizes features from non-verbal modalities compared with conventional approaches. This also underscores that processing non-verbal sub-features separately before integration represents a viable solution for enhancing the performance of the MSA model.
What problem does this paper attempt to address?