Multimodal Sentiment Analysis Network Based on Distributional Transformation and Gated Cross-Modal Fusion

Yuchen Zhang,Hong Zhong,Guilin Chen,Naji Alhusaini,Shenghui Zhao,Cheng Wu
DOI: https://doi.org/10.1109/nana63151.2024.00088
2024-01-01
Abstract:Multimodal sentiment analysis aims to synthesize text, audio, and video modalities to extract sentiment information. Existing research focuses on representation learning and feature fusion, but due to inter-modal data distribution differences, fusion models are often difficult to effectively capture intermodal correlations, especially ignoring unimodal distribution differences, which affects fusion effectiveness. In addition, text features are more important in multimodal sentiment analysis, which increases the challenge of fusing verbal and non-verbal modal information. To address these issues, we propose the Cross-Modal Joint Representation Interaction Network (CMJN), which quantifies the distributional differences between modalities through a Distributed Transformation Layer (DTL) and learns joint representations of verbal and non-verbal using a Gated Cross-Modal Transformer (GCT) to capture inter-modal coherence and complementarity. Experimental results show that CMJN significantly improves multimodal sentiment analysis on CMU-MOSI and CMU-MOSEI datasets.
What problem does this paper attempt to address?