Abstract:Multimodal sentiment analysis models can determine users' sentiments by utilizing rich information from various sources (e.g., textual, visual, and audio). However, there are two key challenges when deploying the model in real-world environments: (1) the limitations of relying on the performance of automatic speech recognition (ASR) models can lead to errors in recognizing sentiment words, which may mislead the sentiment analysis of the textual modality, and (2) variations in information density across modalities complicate the development of a high-quality fusion framework. To address these challenges, this paper proposes a novel Multimodal Sentiment Word Optimization Module and a heterogeneous hierarchical fusion (MSWOHHF) framework. Specifically, the proposed Multimodal Sentiment Word Optimization Module optimizes the sentiment words extracted from the textual modality by the ASR model, thereby reducing sentiment word recognition errors. In the multimodal fusion phase, a heterogeneous hierarchical fusion network architecture is introduced, which first utilizes a Transformer Aggregation Module to fuse the visual and audio modalities, enhancing the high-level semantic features of each modality. A Cross-Attention Fusion Module then integrates the textual modality with the audiovisual fusion. Next, a Feature-Based Attention Fusion Module is proposed that enables fusion by dynamically tuning the weights of both the combined and unimodal representations. It then predicts sentiment polarity using a nonlinear neural network. Finally, the experimental results on the MOSI-SpeechBrain, MOSI-IBM, and MOSI-iFlytek datasets show that the MSWOHHF outperforms several baselines, demonstrating better performance.

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Learning Robust Heterogeneous Signal Features from Parallel Neural Network for Audio Sentiment Analysis

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

Text Sentiment Analysis Based on Convolutional Neural Network and Bidirectional LSTM Model.

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Audio-video Emotion Recognition in the Wild using Deep Hybrid Networks

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

User reviews: Sentiment analysis using lexicon integrated two-channel CNN–LSTM family models

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

Robust Audio-Visual Mandarin Speech Recognition Based on Adaptive Decision Fusion and Tone Features

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments

AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

Learning Affective Features with a Hybrid Deep Model for Audio–Visual Emotion Recognition

Audio Tagging with Compact Feedforward Sequential Memory Network and Audio-to-Audio Ratio Based Data Augmentation

Complementary Fusion of Multi-Features and Multi-Modalities in Sentiment Analysis

Combining Vector Space Features and Convolution Neural Network for Text Sentiment Analysis.

WeaveNet: End-to-End Audiovisual Sentiment Analysis