Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis

You Li,Han Ding,Yuming Lin,Xinyu Feng,Liang Chang
DOI: https://doi.org/10.1007/s10462-023-10685-z
IF: 9.588
2024-03-03
Artificial Intelligence Review
Abstract:Multimodal Aspect-Based Sentiment Analysis (MABSA) is an essential task in sentiment analysis that has garnered considerable attention in recent years. Typical approaches in MABSA often utilize cross-modal Transformers to capture interactions between textual and visual modalities. However, bridging the semantic gap between modalities spaces and addressing interference from irrelevant visual objects at different scales remains challenging. To tackle these limitations, we present the Multi-level Textual-Visual Alignment and Fusion Network (MTVAF) in this work, which incorporates three auxiliary tasks. Specifically, MTVAF first transforms multi-level image information into image descriptions, facial descriptions, and optical characters. These are then concatenated with the textual input to form a textual+visual input, facilitating comprehensive alignment between visual and textual modalities. Next, both inputs are fed into an integrated text model that incorporates relevant visual representations. Dynamic attention mechanisms are employed to generate visual prompts to control cross-modal fusion. Finally, we align the probability distributions of the textual input space and the textual+visual input space, effectively reducing noise introduced during the alignment process. Experimental results on two MABSA benchmark datasets demonstrate the effectiveness of the proposed MTVAF, showcasing its superior performance compared to state-of-the-art approaches. Our codes are available at https://github.com/MKMaS-GUET/MTVAF.
computer science, artificial intelligence
What problem does this paper attempt to address?