Multi-granularity Visual-Textual Jointly Modeling for Aspect-Level Multimodal Sentiment Analysis

Yuzhong Chen,Liyuan Shi,Jiali Lin,Jingtian Chen,Jiayuan Zhong,Chen Dong
DOI: https://doi.org/10.1007/s11227-024-06567-y
2025-01-01
Abstract:Aspect-level multimodal sentiment analysis aims to ascertain the sentiment polarity of a given aspect from a text review and its accompanying image. Despite substantial progress made by existing research, aspect-level multimodal sentiment analysis still faces several challenges: (1) Inconsistency in feature granularity between the text and image modalities poses difficulties in capturing corresponding visual representations of aspect words. This inconsistency may introduce irrelevant or redundant information, thereby causing noise and interference in sentiment analysis. (2) Traditional aspect-level sentiment analysis predominantly relies on the fusion of semantic and syntactic information to determine the sentiment polarity of a given aspect. However, introducing image modality necessitates addressing the semantic gap in jointly understanding sentiment features in different modalities. To address these challenges, a multi-granularity visual-textual feature fusion model (MG-VTFM) is proposed to enable deep sentiment interactions among semantic, syntactic, and image information. First, the model introduces a multi-granularity hierarchical graph attention network that controls the granularity of semantic units interacting with images through constituent tree. This network extracts image sentiment information relevant to the specific granularity, reduces noise from images and ensures sentiment relevance in single-granularity cross-modal interactions. Building upon this, a multilayered graph attention module is employed to accomplish multi-granularity sentiment fusion, ranging from fine to coarse. Furthermore, a progressive multimodal attention fusion mechanism is introduced to maximize the extraction of abstract sentiment information from images. Lastly, a mapping mechanism is proposed to align cross-modal information based on aspect words, unifying semantic spaces across different modalities. Our model demonstrates excellent overall performance on two datasets.
What problem does this paper attempt to address?