Abstract:In the real world, multimodal sentiment analysis (MSA) enables the capture and analysis of sentiments by fusing multimodal information, thereby enhancing the understanding of real-world environments. The key challenges lie in handling the noise in the acquired data and achieving effective multimodal fusion. When processing the noise in data, existing methods utilize the combination of multimodal features to mitigate errors in sentiment word recognition caused by the performance limitations of automatic speech recognition (ASR) models. However, there still remains the problem of how to more efficiently utilize and combine different modalities to address the data noise. In multimodal fusion, most existing fusion methods have limited adaptability to the feature differences between modalities, making it difficult to capture the potential complex nonlinear interactions that may exist between modalities. To overcome the aforementioned issues, this paper proposes a new framework named multimodal-word-refinement and cross-modal-hierarchy (MWRCMH) fusion. Specifically, we utilized a multimodal word correction module to reduce sentiment word recognition errors caused by ASR. During multimodal fusion, we designed a cross-modal hierarchical fusion module that employed cross-modal attention mechanisms to fuse features between pairs of modalities, resulting in fused bimodal-feature information. Then, the obtained bimodal information and the unimodal information were fused through the nonlinear layer to obtain the final multimodal sentiment feature information. Experimental results on the MOSI-SpeechBrain, MOSI-IBM, and MOSI-iFlytek datasets demonstrated that the proposed approach outperformed other comparative methods, achieving Has0-F1 scores of 76.43%, 80.15%, and 81.93%, respectively. Our approach exhibited better performance, as compared to multiple baselines.

Reproducibility Companion Paper of "MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style"

MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style.

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module

Learning Speaker-Independent Multimodal Representation for Sentiment Analysis

Robust-MSA: Understanding the Impact of Modality Noise on Multimodal Sentiment Analysis

Hybrid Multimodal Feature Extraction, Mining and Fusion for Sentiment Analysis

Ch-Sims: A Chinese Multimodal Sentiment Analysis Dataset With Fine-Grained Annotations Of Modality

Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

FDR-MSA: Enhancing multimodal sentiment analysis through feature disentanglement and reconstruction

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis

Evaluation of data inconsistency for multi-modal sentiment analysis

UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition

Integrative Sentiment Analysis: Leveraging Audio, Visual, and Textual Data

FMSA-SC: A Fine-grained Multimodal Sentiment Analysis Dataset based on Stock Comment Videos

Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network

Weakening the Dominant Role of Text: CMOSI Dataset and Multimodal Semantic Enhancement Network

Multimodal Sentiment Analysis with Preferential Fusion and Distance-aware Contrastive Learning.

SentDep: Pioneering Fusion-Centric Multimodal Sentiment Analysis for Unprecedented Performance and Insights

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

A Multimodal Sentiment Analysis Method Integrating Multi-Layer Attention Interaction and Multi-Feature Enhancement