Abstract:Multimodal Aspect-Based Sentiment Analysis (MABSA) technology aims to utilize both textual and visual modalities to achieve Multimodal Aspect Term Extraction (MATE) and Multimodal Aspect Sentiment Classification (MASC) in tweets. Current research has overlooked the impact of noise from irrelevant regions in images on model performance. Additionally, there has been insufficient utilization of the textual information contained within images and the syntactic features of sentences. In this paper, we propose a Target-oriented Cross Modal Transformer (TCMT) for MABSA. The model consists of a textual auxiliary module, a visual auxiliary module, and a main module: the textual aspect-sentiment extraction module, the visual aspect-sentiment prediction module, and the textual-visual alignment cross-modal module. In the textual auxiliary module, we utilize syntactic features to assist the model in identifying the boundaries of multi-word aspect terms and employ Optical Character Recognition (OCR) technology to capture textual information contained within images. In the visual auxiliary module, we employ Adjective-Noun Pairs (ANPs) detection for supervised training of images. Additionally, we have improved the cross-modal Transformer structure by designing a GCN-based Transformer in the textual auxiliary module to learn syntactic graphs, and a CNN-based Transformer in the visual auxiliary module to focus more on important information in images. In the cross-modal MABSA module, we design a target-oriented interaction component to facilitate modal interaction learning and mitigate the impact of image noise, along with an alignment auxiliary component to optimize modal alignment training. We conducted extensive experiments on two publicly available benchmark datasets. The results demonstrate that the performance of the TCMT model is significantly superior to that of the baseline model, achieving state-of-the-art results. Both the textual auxiliary module and the visual auxiliary module effectively assist the cross-modal MABSA module in completing the task more efficiently.

Hierarchical Interactive Multimodal Transformer for Aspect-Based Multimodal Sentiment Analysis

Prior-Bert and Multi-Task Learning for Target-Aspect-Sentiment Joint Detection

Making Flexible Use of Sub-tasks: A Multiplex Interaction Network for Unified Aspect-based Sentiment Analysis

Hierarchical Fusion Network with Enhanced Knowledge and Contrastive Learning for Multimodal Aspect-Based Sentiment Analysis on Social Media

An Interactive Attention Mechanism Fusion Network for Aspect-Based Multimodal Sentiment Analysis

TCMT: Target-oriented Cross Modal Transformer for Multimodal Aspect-Based Sentiment Analysis

A cross-model hierarchical interactive fusion network for end-to-end multimodal aspect-based sentiment analysis

Transformer-based Multi-Aspect Modeling for Multi-Aspect Multi-Sentiment Analysis

Image and Text Aspect Level Multimodal Sentiment Classification Model Using Transformer and Multilayer Attention Interaction

Text-image semantic relevance identification for aspect-based multimodal sentiment analysis

Multimodal Sentiment Analysis Based on Composite Hierarchical Fusion

Multi-Grained Fusion Network with Self-Distillation for Aspect-Based Multimodal Sentiment Analysis

AMIFN: Aspect-guided Multi-view Interactions and Fusion Network for Multimodal Aspect-based Sentiment Analysis

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

Self-adaptive attention fusion for multimodal aspect-based sentiment analysis

AoM: Detecting Aspect-oriented Information for Multimodal Aspect-Based Sentiment Analysis

Bidirectional Complementary Correlation-Based Multimodal Aspect-Level Sentiment Analysis

Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis

TMBL: Transformer-based multimodal binding learning model for multimodal sentiment analysis

An Iterative Multi-Knowledge Transfer Network for Aspect-Based Sentiment Analysis.

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments