Abstract:Multimodal sentiment analysis is an active research area that combines multiple data modalities, e.g., text, image and audio, to analyze human emotions and benefits a variety of applications. Existing multimodal sentiment analysis methods can be classified as modality interaction-based methods, modality transformation-based methods and modality similarity-based methods. However, most of these methods highly rely on the strong correlations between modalities, and cannot fully uncover and utilize the correlations between modalities to enhance sentiment analysis. Therefore, these methods usually achieve bad performance for identifying the sentiment of multimodal data with weak correlations. To address this issue, we proposed a two-stage semi-supervised model termed Correlation-aware Multimodal Transformer (CorMulT) which consists pre-training stage and prediction stage. At the pre-training stage, a modality correlation contrastive learning module is designed to efficiently learn modality correlation coefficients between different modalities. At the prediction stage, the learned correlation coefficients are fused with modality representations to make the sentiment prediction. According to the experiments on the popular multimodal dataset CMU-MOSEI, CorMulT obviously surpasses state-of-the-art multimodal sentiment analysis methods.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address a key issue in multimodal sentiment analysis: existing methods perform poorly when dealing with weakly correlated multimodal data. Specifically, most current multimodal sentiment analysis methods heavily rely on strong correlations between different modalities but fail to fully explore and utilize these correlations to enhance sentiment analysis performance. As a result, these methods often fail to accurately identify sentiments when faced with weakly correlated multimodal data. ### Background and Problem Description Multimodal sentiment analysis is an active research field that combines multiple perceptual modalities (such as text, images, and audio) to analyze human emotions. It is widely used in predicting movie box office, stock market performance, and political election results. Existing multimodal sentiment analysis methods can be roughly divided into three categories: modality interaction methods, modality transformation methods, and modality similarity methods. However, these methods mostly assume strong correlations between modalities and fail to fully exploit and utilize these correlations, leading to poor performance when dealing with weakly correlated multimodal data. ### The Issue of Weak Correlation The issue of weak correlation mainly manifests in the following aspects: 1. **Inconsistent Speech**: Non-primary sounds mask the main audio theme. 2. **Inconsistent Content**: The objects described in the text do not match the objects shown in the images. 3. **Inconsistent Alignment**: Different modalities cannot be fully synchronized. 4. **Inconsistent Clarity**: Excessive noise interferes with the main signal. ### Solution To address the above issues, the paper proposes a two-stage semi-supervised model called the Correlation-aware Multimodal Transformer (CorMulT). This model includes a pre-training stage and a prediction stage: - **Pre-training Stage**: A modality correlation contrastive learning module is designed to efficiently learn the correlation coefficients between different modalities. - **Prediction Stage**: The learned correlation coefficients are fused with modality representations for sentiment prediction. ### Main Contributions 1. **Identifying the Problem and Proposing a Solution**: The paper identifies the issue of weak correlation in multimodal sentiment analysis and proposes the CorMulT model, which enhances sentiment analysis performance by accurately learning the correlations between modalities and integrating them with modality representations. 2. **Modality Correlation Evaluator**: A pre-training model based on contrastive learning is proposed to learn the correlations between modalities, transforming multimodal features into a shared correlation space and effectively quantifying the distances between different modalities. 3. **Experimental Validation**: Extensive experiments were conducted, and the results show that this method significantly outperforms existing multimodal sentiment analysis methods. ### Conclusion By introducing the CorMulT model, the paper effectively addresses the issue of weak correlation in multimodal sentiment analysis, improving the accuracy and robustness of sentiment analysis.

CorMulT: A Semi-supervised Modality Correlation-aware Multimodal Transformer for Sentiment Analysis

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

Multimodal Sentiment Analysis Based on Transformer and Low-rank Fusion

TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis

Multi‐level Deep Correlative Networks for Multi‐modal Sentiment Analysis

Multi-level Correlation Mining Framework with Self-Supervised Label Generation for Multimodal Sentiment Analysis

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

Multimodal Sentiment Analysis Based on a Cross-Modal Multihead Attention Mechanism

Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling

Weakly Correlated Multimodal Sentiment Analysis: New Dataset and Topic-oriented Model

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

SS-Trans (Single-Stream Transformer for Multimodal Sentiment Analysis and Emotion Recognition): The Emotion Whisperer—A Single-Stream Transformer for Multimodal Sentiment Analysis

Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism

Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

Learning Speaker-Independent Multimodal Representation for Sentiment Analysis

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition