Abstract:In the real world, multimodal sentiment analysis (MSA) enables the capture and analysis of sentiments by fusing multimodal information, thereby enhancing the understanding of real-world environments. The key challenges lie in handling the noise in the acquired data and achieving effective multimodal fusion. When processing the noise in data, existing methods utilize the combination of multimodal features to mitigate errors in sentiment word recognition caused by the performance limitations of automatic speech recognition (ASR) models. However, there still remains the problem of how to more efficiently utilize and combine different modalities to address the data noise. In multimodal fusion, most existing fusion methods have limited adaptability to the feature differences between modalities, making it difficult to capture the potential complex nonlinear interactions that may exist between modalities. To overcome the aforementioned issues, this paper proposes a new framework named multimodal-word-refinement and cross-modal-hierarchy (MWRCMH) fusion. Specifically, we utilized a multimodal word correction module to reduce sentiment word recognition errors caused by ASR. During multimodal fusion, we designed a cross-modal hierarchical fusion module that employed cross-modal attention mechanisms to fuse features between pairs of modalities, resulting in fused bimodal-feature information. Then, the obtained bimodal information and the unimodal information were fused through the nonlinear layer to obtain the final multimodal sentiment feature information. Experimental results on the MOSI-SpeechBrain, MOSI-IBM, and MOSI-iFlytek datasets demonstrated that the proposed approach outperformed other comparative methods, achieving Has0-F1 scores of 76.43%, 80.15%, and 81.93%, respectively. Our approach exhibited better performance, as compared to multiple baselines.

Enhancing Multimodal Fusion with Only Unimodal Data

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

Deep Multimodal Fusion Network for Semantic Segmentation Using Remote Sensing Image and LiDAR Data

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

MFFnet: Multimodal Feature Fusion Network for Synthetic Aperture Radar and Optical Image Land Cover Classification

Learning SAR-Optical Cross Modal Features for Land Cover Classification

MEFusion: Unsupervised Mutual Enhancement for Multimodal Image Fusion

Progressive fusion learning: A multimodal joint segmentation framework for building extraction from optical and SAR images

Multimodal Remote Sensing Data Classification Based on Gaussian Mixture Variational Dynamic Fusion Network

Multi-Resolution Multi-Modal Sensor Fusion For Remote Sensing Data With Label Uncertainty

CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

Incomplete Multimodal Learning for Remote Sensing Data Fusion

Multimodal Fusion Method Based on Self-Attention Mechanism

Multimodal Frequeny Spectrum Fusion Schema for RGB-T Image Semantic Segmentation

An Effective Multimodal Representation and Fusion Method for Multimodal Intent Recognition

Learning transferable cross-modality representations for few-shot hyperspectral and LiDAR collaborative classification

Multimodal Hyperspectral Image Classification via Interconnected Fusion

A multimodal fusion framework for urban scene understanding and functional identification using geospatial data

Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network

CIMFNet: Cross-layer Interaction and Multiscale Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Images

Multi-modal Object Detection of UAV Remote Sensing Based on Joint Representation Optimization and Specific Information Enhancement