Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling

N. Majumder,D. Hazarika,A. Gelbukh,E. Cambria,S. Poria
DOI: https://doi.org/10.48550/arXiv.1806.06228
2018-06-16
Abstract:Multimodal sentiment analysis is a very actively growing field of research. A promising area of opportunity in this field is to improve the multimodal fusion mechanism. We present a novel feature fusion strategy that proceeds in a hierarchical fashion, first fusing the modalities two in two and only then fusing all three modalities. On multimodal sentiment analysis of individual utterances, our strategy outperforms conventional concatenation of features by 1%, which amounts to 5% reduction in error rate. On utterance-level multimodal sentiment analysis of multi-utterance video clips, for which current state-of-the-art techniques incorporate contextual information from other utterances of the same clip, our hierarchical fusion gives up to 2.4% (almost 10% error rate reduction) over currently used concatenation. The implementation of our method is publicly available in the form of open-source code.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to more effectively fuse information from different modalities (audio, video, text) in multimodal sentiment analysis. Existing methods usually use simple feature vector concatenation (early fusion) to process multimodal data, and this method cannot effectively filter or resolve conflicting or redundant information from different modalities. Therefore, the author proposes a new hierarchical fusion strategy, which first fuses modalities in pairs and then fuses the three modalities together to improve the performance of multimodal sentiment analysis. Specifically, the main contributions in the paper include: 1. **Proposing a new hierarchical fusion strategy**: Different from the traditional simple concatenation, this new method fuses modality information hierarchically. It first fuses modalities in pairs and then further fuses the results of these binary - modality fusions into a ternary - modality feature vector. This method can better capture the inter - relationships between different modalities, thereby improving the accuracy of sentiment classification. 2. **Introducing a context - aware mechanism**: To further improve the fusion effect, the author uses a recurrent neural network (RNN), especially the gated recurrent unit (GRU), to model the context information between each modality and the fused feature vector. This helps to utilize the context information provided by other sentences in the multi - sentence sentiment analysis of video clips, thereby reducing the error rate. 3. **Experimental verification**: Through experiments on two datasets, CMU - MOSI and IEMOCAP, it is proved that the proposed hierarchical fusion strategy is superior to the traditional feature concatenation method in both individual - sentence and multi - sentence video - clip sentiment analysis. Specifically, on the CMU - MOSI dataset, this method reduces the error rate in multimodal sentiment analysis by 5%, and in the multi - sentence video - clip sentiment analysis, the error rate is reduced by nearly 10%. In summary, the main goal of this paper is to improve the accuracy and robustness of multimodal sentiment analysis by improving the multimodal fusion mechanism, especially when dealing with complex emotional expressions.