Cross-modal contrastive learning for multimodal sentiment recognition

Shanliang Yang,Lichao Cui,Lei Wang,Tao Wang,Yang, Shanliang
DOI: https://doi.org/10.1007/s10489-024-05355-8
IF: 5.3
2024-03-26
Applied Intelligence
Abstract:Multimodal sentiment recognition has obtained increasing attention in recent years due to its potential to improve sentiment recognition accuracy by integrating information from multiple modalities. However, the heterogeneity issue caused by the differences in modalities poses a significant challenge for multimodal sentiment recognition. In this paper, we propose a novel framework, Cross-Modal Contrastive Learning (CMCL), which integrates multiple contrastive learning methods and multimodal data augmentation to address the heterogeneity issue. Specifically, we establish a cross-modal contrastive learning framework by leveraging diversity contrastive learning, consistency contrastive learning and sample-level contrastive learning. Through diversity contrastive learning, we constrain modality features to different feature spaces, capturing the complementary nature of modality-specific features. Additionally, through consistency contrastive learning, we map the representations of different modalities into a shared feature space, capturing the consistency of modality-specific features. We also introduce two data augmentation techniques, namely random noise and modal combination, to improve the model's robustness. The experimental results show that our approach achieves state-of-the-art performance on three benchmark datasets and outperforms the existing baseline models. Our work demonstrates the effectiveness of cross-modal contrastive learning and data augmentation in multimodal sentiment recognition, and provides valuable insights for future research in this area.
computer science, artificial intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the heterogeneity problem in multi - modal emotion recognition. Specifically, due to the differences between different modalities (such as text, audio, video), there are significant challenges in fusing these modal information. These problems include: 1. **Complementarity and Consistency of Modal Features**: There may be complementary information between different modalities, but existing methods often overlook this, resulting in the loss of valuable information. 2. **Redundant Features in the Modal Fusion Process**: In the process of multi - modal feature fusion, redundant features may be generated, which affects the accuracy of emotion recognition. 3. **Heterogeneity between Modalities**: Data of different modalities have differences in semantic space, which makes it difficult to directly fuse these modalities. To solve these problems, the author proposes a new framework - Cross - Modal Contrastive Learning (CMCL), which improves the performance of multi - modal emotion recognition by integrating multiple contrastive learning methods and multi - modal data augmentation techniques. Specific methods include: - **Diversity Contrastive Learning (DCL)**: By maintaining different modalities in different feature spaces, capture the complementary nature of modality - specific features. - **Consistency Contrastive Learning (CCL)**: Map the representations of different modalities to a shared feature space to capture the consistency of modality - specific features. - **Sample - level Contrastive Learning (SCL)**: Through sample - level contrastive learning, improve the robustness of the model to individual differences in emotional expression. - **Multi - modal Data Augmentation**: Introduce data augmentation techniques such as random noise and modality combination to reduce over - fitting and improve model performance. Experimental results show that this method has achieved state - of - the - art performance on three benchmark datasets and outperforms existing baseline models. These results verify the effectiveness and potential of cross - modal contrastive learning and data augmentation in multi - modal emotion recognition.