Abstract:Multimodal Aspect-Based Sentiment Analysis (MABSA) technology aims to utilize both textual and visual modalities to achieve Multimodal Aspect Term Extraction (MATE) and Multimodal Aspect Sentiment Classification (MASC) in tweets. Current research has overlooked the impact of noise from irrelevant regions in images on model performance. Additionally, there has been insufficient utilization of the textual information contained within images and the syntactic features of sentences. In this paper, we propose a Target-oriented Cross Modal Transformer (TCMT) for MABSA. The model consists of a textual auxiliary module, a visual auxiliary module, and a main module: the textual aspect-sentiment extraction module, the visual aspect-sentiment prediction module, and the textual-visual alignment cross-modal module. In the textual auxiliary module, we utilize syntactic features to assist the model in identifying the boundaries of multi-word aspect terms and employ Optical Character Recognition (OCR) technology to capture textual information contained within images. In the visual auxiliary module, we employ Adjective-Noun Pairs (ANPs) detection for supervised training of images. Additionally, we have improved the cross-modal Transformer structure by designing a GCN-based Transformer in the textual auxiliary module to learn syntactic graphs, and a CNN-based Transformer in the visual auxiliary module to focus more on important information in images. In the cross-modal MABSA module, we design a target-oriented interaction component to facilitate modal interaction learning and mitigate the impact of image noise, along with an alignment auxiliary component to optimize modal alignment training. We conducted extensive experiments on two publicly available benchmark datasets. The results demonstrate that the performance of the TCMT model is significantly superior to that of the baseline model, achieving state-of-the-art results. Both the textual auxiliary module and the visual auxiliary module effectively assist the cross-modal MABSA module in completing the task more efficiently.

Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis

Cross-modal fine-grained alignment and fusion network for multimodal aspect-based sentiment analysis

AMIFN: Aspect-guided Multi-view Interactions and Fusion Network for Multimodal Aspect-based Sentiment Analysis

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

An Interactive Attention Mechanism Fusion Network for Aspect-Based Multimodal Sentiment Analysis

TCMT: Target-oriented Cross Modal Transformer for Multimodal Aspect-Based Sentiment Analysis

MSFNet: modality smoothing fusion network for multimodal aspect-based sentiment analysis

Aspects Are Anchors: Towards Multimodal Aspect-based Sentiment Analysis Via Aspect-driven Alignment and Refinement

Multi-Grained Fusion Network with Self-Distillation for Aspect-Based Multimodal Sentiment Analysis

Hierarchical Fusion Network with Enhanced Knowledge and Contrastive Learning for Multimodal Aspect-Based Sentiment Analysis on Social Media

Self-adaptive attention fusion for multimodal aspect-based sentiment analysis

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

MFSC: A Multimodal Aspect-Level Sentiment Classification Framework with Multi-Image Gate and Fusion Networks

Multi-layer cross-modality attention fusion network for multimodal sentiment analysis

PTA: Enhancing Multimodal Sentiment Analysis through Pipelined Prediction and Translation-based Alignment

MAVA: Multi-Level Adaptive Visual-Textual Alignment by Cross-Media Bi-Attention Mechanism.

Multi-level Attention Map Network for Multimodal Sentiment Analysis

Text-oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

Multi-Model Fusion Framework Using Deep Learning for Visual-Textual Sentiment Classification