Abstract:Multimodal sentiment classification is a notable research field that aims to refine sentimental information and classify the sentiment tendency from sequential multimodal data. Most existing sentimental recognition algorithms explore multimodal fusion schemes that achieve good performance. However, there are two key challenges to overcome. First, it is essential to effectively extract inter- and intra-modality features prior to fusion, while simultaneously reducing ambiguity. The second challenge is how to learn modality-invariant representations that capture the underlying similarities. In this paper, we present a modality-invariant temporal learning technique and a new gated inter-modality attention mechanism to overcome these issues. For the first challenge, our proposed gated inter-modality attention mechanism performs modality interactions and filters inconsistencies from multiple modalities in an adaptive manner. We also use parallel structures to learn more comprehensive sentimental information in pairs (i.e., acoustic and visual). In addition, to address the second problem, we treat each modality as a multivariate Gaussian distribution (considering each timestamp as a single Gaussian distribution) and use the KL divergence to capture the implicit temporal distribution-level similarities. These strategies are helpful in reducing domain shifts between different modalities and extracting effective sequential modality-invariant representations. We have conducted experiments on several public datasets (i.e., YouTube and MOUD) and the results show that our proposed method outperforms the state-of-the-art multimodal sentiment categorization methods.

Learning Speaker-Independent Multimodal Representation for Sentiment Analysis

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

An Autoencoder-based Self-Supervised Learning for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Based on Transformer and Low-rank Fusion

Multimodal Representations Learning Based on Mutual Information Maximization and Minimization and Identity Embedding for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Missing Modality Reconstruction Network Based on Shared-Specific Features

Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis

TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition

Multimodal Sentiment Recognition With Multi-Task Learning

A Multimodal Sentiment Analysis Method Integrating Multi-Layer Attention Interaction and Multi-Feature Enhancement

Multimodal Sentiment Analysis: A Survey

M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Balanced sentimental information via multimodal interaction model

Semantic-specific multimodal relation learning for sentiment analysis

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition