Abstract:Multimodal sentiment analysis faces two challenges: modality representation and modality fusion. Most of the existing models rely only on the feature extraction network to learn modality representation, and the fusion mechanism adopted by some models does not perform well. These factors are not conducive to the model learning rich emotional information and further affect the model’s predictive ability. To solve these problems, we propose a multimodal sentiment analysis model based on two-stage contrastive learning and feature hierarchical fusion network (TSCL-FHFN). First, we apply the idea of contrastive learning to unimodal feature representation and multimodal fusion feature representation respectively. By designing a two-stage contrastive learning task, TSCL-FHFN learns similar features for data with the same emotion category and learns distinguishable features for data with different emotion categories. This enables the model to better learn the features of emotional differences. Second, in order to further explore the deep semantic association of multimodal data, we propose a multimodal feature hierarchical fusion network (FHFN). The core idea is to design an attention-based directional cross-modal transformer so that one modality can receive information from the other modality, thereby obtaining complementary information between two modalities. Then, FHFN uses the low-rank tensor fusion method to further learn interactive information between multiple modalities. Finally, we conduct a series of comparative experiments on CMU-MOSI and CMU-MOSEI datasets. Compared with the current representative models, the TSCL-FHFN model achieves better experimental results. In addition, ablation experiments further verify the effectiveness of the improved TSCL-FHFN model.

Self-HCL: Self-Supervised Multitask Learning with Hybrid Contrastive Learning Strategy for Multimodal Sentiment Analysis

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Low-rank tensor fusion and self-supervised multi-task multimodal sentiment analysis

TSCL-FHFN: two-stage contrastive learning and feature hierarchical fusion network for multimodal sentiment analysis

Multi-level Contrastive Learning: Hierarchical Alleviation of Heterogeneity in Multimodal Sentiment Analysis

Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion

Dynamic Weighted Multitask Learning and Contrastive Learning for Multimodal Sentiment Analysis

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Cross-modal contrastive learning for multimodal sentiment recognition

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

Multi-level Correlation Mining Framework with Self-Supervised Label Generation for Multimodal Sentiment Analysis

Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

A text guided multi-task learning network for multimodal sentiment analysis

Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

A Multimodal Sentiment Analysis Method Integrating Multi-Layer Attention Interaction and Multi-Feature Enhancement

M$^{3}$SA: Multimodal Sentiment Analysis Based on Multi-Scale Feature Extraction and Multi-Task Learning

Meta-Learn Unimodal Signals with Weak Supervision for Multimodal Sentiment Analysis