Abstract:In the real world, multimodal sentiment analysis (MSA) enables the capture and analysis of sentiments by fusing multimodal information, thereby enhancing the understanding of real-world environments. The key challenges lie in handling the noise in the acquired data and achieving effective multimodal fusion. When processing the noise in data, existing methods utilize the combination of multimodal features to mitigate errors in sentiment word recognition caused by the performance limitations of automatic speech recognition (ASR) models. However, there still remains the problem of how to more efficiently utilize and combine different modalities to address the data noise. In multimodal fusion, most existing fusion methods have limited adaptability to the feature differences between modalities, making it difficult to capture the potential complex nonlinear interactions that may exist between modalities. To overcome the aforementioned issues, this paper proposes a new framework named multimodal-word-refinement and cross-modal-hierarchy (MWRCMH) fusion. Specifically, we utilized a multimodal word correction module to reduce sentiment word recognition errors caused by ASR. During multimodal fusion, we designed a cross-modal hierarchical fusion module that employed cross-modal attention mechanisms to fuse features between pairs of modalities, resulting in fused bimodal-feature information. Then, the obtained bimodal information and the unimodal information were fused through the nonlinear layer to obtain the final multimodal sentiment feature information. Experimental results on the MOSI-SpeechBrain, MOSI-IBM, and MOSI-iFlytek datasets demonstrated that the proposed approach outperformed other comparative methods, achieving Has0-F1 scores of 76.43%, 80.15%, and 81.93%, respectively. Our approach exhibited better performance, as compared to multiple baselines.

Prompt Link Multimodal Fusion in Multimodal Sentiment Analysis

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Few-shot Multimodal Sentiment Analysis based on Multimodal Probabilistic Fusion Prompts

Tri-Modalities Fusion for Multimodal Sentiment Analysis

CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection

Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Conditional Prompt Tuning for Multimodal Fusion

Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling

Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network

Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning

Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

CSMF-SPC: Multimodal Sentiment Analysis Model with Effective Context Semantic Modality Fusion and Sentiment Polarity Correction

Multimodal Sentiment Analysis with Preferential Fusion and Distance-aware Contrastive Learning.

Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion

Multi-level Correlation Mining Framework with Self-Supervised Label Generation for Multimodal Sentiment Analysis

TSCL-FHFN: two-stage contrastive learning and feature hierarchical fusion network for multimodal sentiment analysis

Multimodal Multi-loss Fusion Network for Sentiment Analysis