Abstract:Multimodal sentiment analysis faces two challenges: modality representation and modality fusion. Most of the existing models rely only on the feature extraction network to learn modality representation, and the fusion mechanism adopted by some models does not perform well. These factors are not conducive to the model learning rich emotional information and further affect the model’s predictive ability. To solve these problems, we propose a multimodal sentiment analysis model based on two-stage contrastive learning and feature hierarchical fusion network (TSCL-FHFN). First, we apply the idea of contrastive learning to unimodal feature representation and multimodal fusion feature representation respectively. By designing a two-stage contrastive learning task, TSCL-FHFN learns similar features for data with the same emotion category and learns distinguishable features for data with different emotion categories. This enables the model to better learn the features of emotional differences. Second, in order to further explore the deep semantic association of multimodal data, we propose a multimodal feature hierarchical fusion network (FHFN). The core idea is to design an attention-based directional cross-modal transformer so that one modality can receive information from the other modality, thereby obtaining complementary information between two modalities. Then, FHFN uses the low-rank tensor fusion method to further learn interactive information between multiple modalities. Finally, we conduct a series of comparative experiments on CMU-MOSI and CMU-MOSEI datasets. Compared with the current representative models, the TSCL-FHFN model achieves better experimental results. In addition, ablation experiments further verify the effectiveness of the improved TSCL-FHFN model.

CTHFNet: contrastive translation and hierarchical fusion network for text–video–audio sentiment analysis

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

TSCL-FHFN: two-stage contrastive learning and feature hierarchical fusion network for multimodal sentiment analysis

NHFNET: A Non-Homogeneous Fusion Network for Multimodal Sentiment Analysis

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection

A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning

TeFNA: Text-centered Fusion Network with crossmodal Attention for multimodal sentiment analysis

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Multi-level Contrastive Learning: Hierarchical Alleviation of Heterogeneity in Multimodal Sentiment Analysis

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion

Visual-Textual Sentiment Analysis Enhanced by Hierarchical Cross-Modality Interaction

Hierarchical graph contrastive learning framework based on quantum neural networks for sentiment analysis

Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling