NHFNET: A Non-Homogeneous Fusion Network for Multimodal Sentiment Analysis

Ziwang Fu,Feng Liu,Qing Xu,Jiayin Qi,Xiangling Fu,Aimin Zhou,Zhibin Li
DOI: https://doi.org/10.1109/icme52920.2022.9859836
2022-01-01
Abstract:Fusion technology is crucial for multimodal sentiment analysis. Recent attention-based fusion methods demonstrate high performance and strong robustness. However, these approaches ignore the difference in information density among the three modalities, i.e., visual and audio have low-level signal features and conversely text has high-level semantic features. To this end, we propose a non-homogeneous fusion network (NHFNet) to achieve multimodal information interaction. Specifically, a fusion module with attention aggregation is designed to handle the fusion of visual and audio modalities to enhance them to high-level semantic features. Then, cross-modal attention is used to achieve information reinforcement of text modality and audio-visual fusion. NHFNet compensates for the differences in information density of different modalities enabling their fair interaction. To verify the effectiveness of the proposed method, we set up the aligned and unaligned experiments on the CMU-MOSEI dataset, respectively. The experimental results show that the proposed method outperforms the state-of-the-art. Codes are available at https://github.com/skeletonNN/NHFNet.
What problem does this paper attempt to address?