GLoMo: Global-Local Modal Fusion for Multimodal Sentiment Analysis

Yan Zhuang,Yanru Zhang,Zheng Hu,Xiaoyue Zhang,Jiawen Deng,Fuji Ren
DOI: https://doi.org/10.1145/3664647.3681527
2024-01-01
Abstract:Multimodal Sentiment Analysis (MSA) has witnessed remarkable progress and gained increasing attention in recent decade. However, current MSA methodologies primarily rely on global representations extracted from different modalities, such as the mean of all token representations, to construct sophisticated fusion networks. These approaches often overlook the valuable details present in local representations, which consist of fused representations of consecutive several tokens. Additionally, the integration of multiple local representations, and the fusion of local and global information present significant challenges. To address these limitations, we propose the Global-Local Modal (GLoMo) Fusion framework. It comprises two essential components: (i) modality-specific mixture of experts layers that integrate diverse local representations within each modality, and (ii) a global-guided fusion module that effectively combines global and local representations. The former component leverages specialized expert networks to automatically select and integrate crucial local representations from each modality, while the latter ensures the preservation of global information during the fusion process. We evaluate GLoMo on various datasets, encompassing tasks in multimodal sentiment analysis, multimodal humor detection, and multimodal emotion recognition. Extensive experiments demonstrate that GLoMo outperforms existing state-of-the-art models, validating the effectiveness of our proposed framework. Our code is publicly available at https://github.com/YetZzzzzz/GLoMo.
What problem does this paper attempt to address?