Abstract:Multimodal sentiment recognition has obtained increasing attention in recent years due to its potential to improve sentiment recognition accuracy by integrating information from multiple modalities. However, the heterogeneity issue caused by the differences in modalities poses a significant challenge for multimodal sentiment recognition. In this paper, we propose a novel framework, Cross-Modal Contrastive Learning (CMCL), which integrates multiple contrastive learning methods and multimodal data augmentation to address the heterogeneity issue. Specifically, we establish a cross-modal contrastive learning framework by leveraging diversity contrastive learning, consistency contrastive learning and sample-level contrastive learning. Through diversity contrastive learning, we constrain modality features to different feature spaces, capturing the complementary nature of modality-specific features. Additionally, through consistency contrastive learning, we map the representations of different modalities into a shared feature space, capturing the consistency of modality-specific features. We also introduce two data augmentation techniques, namely random noise and modal combination, to improve the model's robustness. The experimental results show that our approach achieves state-of-the-art performance on three benchmark datasets and outperforms the existing baseline models. Our work demonstrates the effectiveness of cross-modal contrastive learning and data augmentation in multimodal sentiment recognition, and provides valuable insights for future research in this area.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the heterogeneity problem in multi - modal emotion recognition. Specifically, due to the differences between different modalities (such as text, audio, video), there are significant challenges in fusing these modal information. These problems include: 1. **Complementarity and Consistency of Modal Features**: There may be complementary information between different modalities, but existing methods often overlook this, resulting in the loss of valuable information. 2. **Redundant Features in the Modal Fusion Process**: In the process of multi - modal feature fusion, redundant features may be generated, which affects the accuracy of emotion recognition. 3. **Heterogeneity between Modalities**: Data of different modalities have differences in semantic space, which makes it difficult to directly fuse these modalities. To solve these problems, the author proposes a new framework - Cross - Modal Contrastive Learning (CMCL), which improves the performance of multi - modal emotion recognition by integrating multiple contrastive learning methods and multi - modal data augmentation techniques. Specific methods include: - **Diversity Contrastive Learning (DCL)**: By maintaining different modalities in different feature spaces, capture the complementary nature of modality - specific features. - **Consistency Contrastive Learning (CCL)**: Map the representations of different modalities to a shared feature space to capture the consistency of modality - specific features. - **Sample - level Contrastive Learning (SCL)**: Through sample - level contrastive learning, improve the robustness of the model to individual differences in emotional expression. - **Multi - modal Data Augmentation**: Introduce data augmentation techniques such as random noise and modality combination to reduce over - fitting and improve model performance. Experimental results show that this method has achieved state - of - the - art performance on three benchmark datasets and outperforms existing baseline models. These results verify the effectiveness and potential of cross - modal contrastive learning and data augmentation in multi - modal emotion recognition.

Cross-modal contrastive learning for multimodal sentiment recognition

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

Multi-level Contrastive Learning: Hierarchical Alleviation of Heterogeneity in Multimodal Sentiment Analysis

Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion

Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis

Self-HCL: Self-Supervised Multitask Learning with Hybrid Contrastive Learning Strategy for Multimodal Sentiment Analysis

TSCL-FHFN: two-stage contrastive learning and feature hierarchical fusion network for multimodal sentiment analysis

Dynamic Weighted Multitask Learning and Contrastive Learning for Multimodal Sentiment Analysis

A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning

Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment

Improving Multimodal Sentiment Analysis: Supervised Angular Margin-based Contrastive Learning for Enhanced Fusion Representation

On the Generalization of Multi-modal Contrastive Learning

Toward Robust Multimodal Learning using Multimodal Foundational Models

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning