Abstract:Multimodal sentiment analysis (MSA), which supposes to improve text-based sentiment analysis with associated acoustic and visual modalities, is an emerging research area due to its potential applications in Human-Computer Interaction (HCI). However, the existing researches observe that the acoustic and visual modalities contribute much less than the textual modality, termed as text-predominant. Under such circumstances, in this work, we emphasize making non-verbal cues matter for the MSA task. Firstly, from the resource perspective, we present the CH-SIMS v2.0 dataset, an extension and enhancement of the CH-SIMS. Compared with the original dataset, the CH-SIMS v2.0 doubles its size with another 2121 refined video segments with both unimodal and multimodal annotations and collects 10161 unlabelled raw video segments with rich acoustic and visual emotion-bearing context to highlight non-verbal cues for sentiment prediction. Secondly, from the model perspective, benefiting from the unimodal annotations and the unsupervised data in the CH-SIMS v2.0, the Acoustic Visual Mixup Consistent (AV-MC) framework is proposed. The designed modality mixup module can be regarded as an augmentation, which mixes the acoustic and visual modalities from different videos. Through drawing unobserved multimodal context along with the text, the model can learn to be aware of different non-verbal contexts for sentiment prediction. Our evaluations demonstrate that both CH-SIMS v2.0 and AV-MC framework enables further research for discovering emotion-bearing acoustic and visual cues and paves the path to interpretable end-to-end HCI applications for real-world scenarios.

Sentiment Knowledge Enhanced Self-supervised Learning for Multimodal Sentiment Analysis.

KEBR: Knowledge Enhanced Self-Supervised Balanced Representation for Multimodal Sentiment Analysis

Self-HCL: Self-Supervised Multitask Learning with Hybrid Contrastive Learning Strategy for Multimodal Sentiment Analysis

AMSA: Adaptive Multimodal Learning for Sentiment Analysis

KESA: A Knowledge Enhanced Approach For Sentiment Analysis

KESA: A Knowledge Enhanced Approach to Sentiment Analysis

Semantic-specific multimodal relation learning for sentiment analysis

SKEAFN: Sentiment Knowledge Enhanced Attention Fusion Network for multimodal sentiment analysis

Hierarchical Knowledge Stripping for Multimodal Sentiment Analysis

Target-oriented Sentiment Classification with Sequential Cross-modal Semantic Graph

A Multimodal Sentiment Analysis Method Integrating Multi-Layer Attention Interaction and Multi-Feature Enhancement

M$^{3}$SA: Multimodal Sentiment Analysis Based on Multi-Scale Feature Extraction and Multi-Task Learning

Low-rank tensor fusion and self-supervised multi-task multimodal sentiment analysis

Multimodal Sentiment Analysis With Two-Phase Multi-Task Learning

A text guided multi-task learning network for multimodal sentiment analysis

Word-wise Sparse Attention for Multimodal Sentiment Analysis

Towards Robust Multimodal Sentiment Analysis with Incomplete Data

Sentiment-aware Multimodal Pre-Training for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis with Preferential Fusion and Distance-aware Contrastive Learning.

Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module

Cooperative Sentiment Agents for Multimodal Sentiment Analysis