Abstract:Multimodal sentiment analysis is a very actively growing field of research. A promising area of opportunity in this field is to improve the multimodal fusion mechanism. We present a novel feature fusion strategy that proceeds in a hierarchical fashion, first fusing the modalities two in two and only then fusing all three modalities. On multimodal sentiment analysis of individual utterances, our strategy outperforms conventional concatenation of features by 1%, which amounts to 5% reduction in error rate. On utterance-level multimodal sentiment analysis of multi-utterance video clips, for which current state-of-the-art techniques incorporate contextual information from other utterances of the same clip, our hierarchical fusion gives up to 2.4% (almost 10% error rate reduction) over currently used concatenation. The implementation of our method is publicly available in the form of open-source code.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to more effectively fuse information from different modalities (audio, video, text) in multimodal sentiment analysis. Existing methods usually use simple feature vector concatenation (early fusion) to process multimodal data, and this method cannot effectively filter or resolve conflicting or redundant information from different modalities. Therefore, the author proposes a new hierarchical fusion strategy, which first fuses modalities in pairs and then fuses the three modalities together to improve the performance of multimodal sentiment analysis. Specifically, the main contributions in the paper include: 1. **Proposing a new hierarchical fusion strategy**: Different from the traditional simple concatenation, this new method fuses modality information hierarchically. It first fuses modalities in pairs and then further fuses the results of these binary - modality fusions into a ternary - modality feature vector. This method can better capture the inter - relationships between different modalities, thereby improving the accuracy of sentiment classification. 2. **Introducing a context - aware mechanism**: To further improve the fusion effect, the author uses a recurrent neural network (RNN), especially the gated recurrent unit (GRU), to model the context information between each modality and the fused feature vector. This helps to utilize the context information provided by other sentences in the multi - sentence sentiment analysis of video clips, thereby reducing the error rate. 3. **Experimental verification**: Through experiments on two datasets, CMU - MOSI and IEMOCAP, it is proved that the proposed hierarchical fusion strategy is superior to the traditional feature concatenation method in both individual - sentence and multi - sentence video - clip sentiment analysis. Specifically, on the CMU - MOSI dataset, this method reduces the error rate in multimodal sentiment analysis by 5%, and in the multi - sentence video - clip sentiment analysis, the error rate is reduced by nearly 10%. In summary, the main goal of this paper is to improve the accuracy and robustness of multimodal sentiment analysis by improving the multimodal fusion mechanism, especially when dealing with complex emotional expressions.

Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

Sentiment analysis using Hierarchical Multimodal Fusion (HMF)

Multimodal Sentiment Analysis Based on Composite Hierarchical Fusion

Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments

Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network

A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning

Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

Multi‐level feature optimization and multimodal contextual fusion for sentiment analysis and emotion classification

Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

Tri-Modalities Fusion for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling

Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment

Tensor Fusion Network for Multimodal Sentiment Analysis

What Makes the Difference? an Empirical Comparison of Fusion Strategies for Multimodal Language Analysis.

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

Dynamically Shifting Multimodal Representations Via Hybrid-Modal Attention for Multimodal Sentiment Analysis

Two-Level Multimodal Fusion for Sentiment Analysis in Public Security