Abstract:Multimodal sentiment classification is a notable research field that aims to refine sentimental information and classify the sentiment tendency from sequential multimodal data. Most existing sentimental recognition algorithms explore multimodal fusion schemes that achieve good performance. However, there are two key challenges to overcome. First, it is essential to effectively extract inter- and intra-modality features prior to fusion, while simultaneously reducing ambiguity. The second challenge is how to learn modality-invariant representations that capture the underlying similarities. In this paper, we present a modality-invariant temporal learning technique and a new gated inter-modality attention mechanism to overcome these issues. For the first challenge, our proposed gated inter-modality attention mechanism performs modality interactions and filters inconsistencies from multiple modalities in an adaptive manner. We also use parallel structures to learn more comprehensive sentimental information in pairs (i.e., acoustic and visual). In addition, to address the second problem, we treat each modality as a multivariate Gaussian distribution (considering each timestamp as a single Gaussian distribution) and use the KL divergence to capture the implicit temporal distribution-level similarities. These strategies are helpful in reducing domain shifts between different modalities and extracting effective sequential modality-invariant representations. We have conducted experiments on several public datasets (i.e., YouTube and MOUD) and the results show that our proposed method outperforms the state-of-the-art multimodal sentiment categorization methods.

Multimodal sentiment analysis with two-phase multi-task learning

Multimodal Sentiment Analysis With Two-Phase Multi-Task Learning

A text guided multi-task learning network for multimodal sentiment analysis

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

M$^{3}$SA: Multimodal Sentiment Analysis Based on Multi-Scale Feature Extraction and Multi-Task Learning

Low-rank tensor fusion and self-supervised multi-task multimodal sentiment analysis

Text-oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Multi-modal Sentiment and Emotion Joint Analysis with a Deep Attentive Multi-task Learning Model

Multi-layer cross-modality attention fusion network for multimodal sentiment analysis

Dynamic Weighted Multitask Learning and Contrastive Learning for Multimodal Sentiment Analysis

A Multimodal Sentiment Analysis Method Integrating Multi-Layer Attention Interaction and Multi-Feature Enhancement

Towards Robust Multimodal Sentiment Analysis with Incomplete Data

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Multimodal Sentiment Recognition With Multi-Task Learning

Sentiment-aware Multimodal Pre-Training for Multimodal Sentiment Analysis

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition

A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning

Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis