CorMulT: A Semi-supervised Modality Correlation-aware Multimodal Transformer for Sentiment Analysis

Yangmin Li,Ruiqi Zhu,Wengen Li
2024-08-29
Abstract:Multimodal sentiment analysis is an active research area that combines multiple data modalities, e.g., text, image and audio, to analyze human emotions and benefits a variety of applications. Existing multimodal sentiment analysis methods can be classified as modality interaction-based methods, modality transformation-based methods and modality similarity-based methods. However, most of these methods highly rely on the strong correlations between modalities, and cannot fully uncover and utilize the correlations between modalities to enhance sentiment analysis. Therefore, these methods usually achieve bad performance for identifying the sentiment of multimodal data with weak correlations. To address this issue, we proposed a two-stage semi-supervised model termed Correlation-aware Multimodal Transformer (CorMulT) which consists pre-training stage and prediction stage. At the pre-training stage, a modality correlation contrastive learning module is designed to efficiently learn modality correlation coefficients between different modalities. At the prediction stage, the learned correlation coefficients are fused with modality representations to make the sentiment prediction. According to the experiments on the popular multimodal dataset CMU-MOSEI, CorMulT obviously surpasses state-of-the-art multimodal sentiment analysis methods.
Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address a key issue in multimodal sentiment analysis: existing methods perform poorly when dealing with weakly correlated multimodal data. Specifically, most current multimodal sentiment analysis methods heavily rely on strong correlations between different modalities but fail to fully explore and utilize these correlations to enhance sentiment analysis performance. As a result, these methods often fail to accurately identify sentiments when faced with weakly correlated multimodal data. ### Background and Problem Description Multimodal sentiment analysis is an active research field that combines multiple perceptual modalities (such as text, images, and audio) to analyze human emotions. It is widely used in predicting movie box office, stock market performance, and political election results. Existing multimodal sentiment analysis methods can be roughly divided into three categories: modality interaction methods, modality transformation methods, and modality similarity methods. However, these methods mostly assume strong correlations between modalities and fail to fully exploit and utilize these correlations, leading to poor performance when dealing with weakly correlated multimodal data. ### The Issue of Weak Correlation The issue of weak correlation mainly manifests in the following aspects: 1. **Inconsistent Speech**: Non-primary sounds mask the main audio theme. 2. **Inconsistent Content**: The objects described in the text do not match the objects shown in the images. 3. **Inconsistent Alignment**: Different modalities cannot be fully synchronized. 4. **Inconsistent Clarity**: Excessive noise interferes with the main signal. ### Solution To address the above issues, the paper proposes a two-stage semi-supervised model called the Correlation-aware Multimodal Transformer (CorMulT). This model includes a pre-training stage and a prediction stage: - **Pre-training Stage**: A modality correlation contrastive learning module is designed to efficiently learn the correlation coefficients between different modalities. - **Prediction Stage**: The learned correlation coefficients are fused with modality representations for sentiment prediction. ### Main Contributions 1. **Identifying the Problem and Proposing a Solution**: The paper identifies the issue of weak correlation in multimodal sentiment analysis and proposes the CorMulT model, which enhances sentiment analysis performance by accurately learning the correlations between modalities and integrating them with modality representations. 2. **Modality Correlation Evaluator**: A pre-training model based on contrastive learning is proposed to learn the correlations between modalities, transforming multimodal features into a shared correlation space and effectively quantifying the distances between different modalities. 3. **Experimental Validation**: Extensive experiments were conducted, and the results show that this method significantly outperforms existing multimodal sentiment analysis methods. ### Conclusion By introducing the CorMulT model, the paper effectively addresses the issue of weak correlation in multimodal sentiment analysis, improving the accuracy and robustness of sentiment analysis.