Multi-layer cross-modality attention fusion network for multimodal sentiment analysis

Zihao Yin,Yongping Du,Yang Liu,Yuxin Wang
DOI: https://doi.org/10.1007/s11042-023-17685-9
IF: 2.577
2024-01-05
Multimedia Tools and Applications
Abstract:Sentiment analysis aims to detect the sentiment polarity towards the massive opinions and reviews emerging on the internet. With the increasing of multimodal information on social media, such as text, image, audio and video, multimodal sentiment analysis has attracted more attention in recent years and our work focuses on the text and image data. The previous works usually ignore the semantic alignment between the text and image, and cannot capture the interaction between them, which will affect the correct judgement for the sentiment polarity prediction. To resolve these problems, we propose a novel multimodal sentiment analysis model LXMERT-MMSA based on cross-modality attention mechanism. The single-modality feature is encoded by multi-layer Transformer encoder to achieve the deep semantic information implied in the text and image. Moreover, the cross-modality attention mechanism enables the model to fuse the text and image features effectively and achieve the rich semantic information by the alignment. It improves the ability of the model to capture the semantic relation between text and image. The evaluation metrics of accuracy and F1 score are used, and the experimental results on MVSA-multiple dataset and Twitter dataset show that our proposed model outperforms the previous SOTA model, and the ablation experimental results further prove that the model can make well use of multimodal features.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?