Abstract:Multimodal Aspect-Based Sentiment Analysis (MABSA) technology aims to utilize both textual and visual modalities to achieve Multimodal Aspect Term Extraction (MATE) and Multimodal Aspect Sentiment Classification (MASC) in tweets. Current research has overlooked the impact of noise from irrelevant regions in images on model performance. Additionally, there has been insufficient utilization of the textual information contained within images and the syntactic features of sentences. In this paper, we propose a Target-oriented Cross Modal Transformer (TCMT) for MABSA. The model consists of a textual auxiliary module, a visual auxiliary module, and a main module: the textual aspect-sentiment extraction module, the visual aspect-sentiment prediction module, and the textual-visual alignment cross-modal module. In the textual auxiliary module, we utilize syntactic features to assist the model in identifying the boundaries of multi-word aspect terms and employ Optical Character Recognition (OCR) technology to capture textual information contained within images. In the visual auxiliary module, we employ Adjective-Noun Pairs (ANPs) detection for supervised training of images. Additionally, we have improved the cross-modal Transformer structure by designing a GCN-based Transformer in the textual auxiliary module to learn syntactic graphs, and a CNN-based Transformer in the visual auxiliary module to focus more on important information in images. In the cross-modal MABSA module, we design a target-oriented interaction component to facilitate modal interaction learning and mitigate the impact of image noise, along with an alignment auxiliary component to optimize modal alignment training. We conducted extensive experiments on two publicly available benchmark datasets. The results demonstrate that the performance of the TCMT model is significantly superior to that of the baseline model, achieving state-of-the-art results. Both the textual auxiliary module and the visual auxiliary module effectively assist the cross-modal MABSA module in completing the task more efficiently.

Target-oriented Sentiment Classification with Sequential Cross-modal Semantic Graph

Dual Edge-embedding Graph Convolutional Network for Unified Aspect-based Sentiment Analysis

Cross-modal image sentiment analysis via deep correlation of textual semantic

MFSC: A Multimodal Aspect-Level Sentiment Classification Framework with Multi-Image Gate and Fusion Networks

Multimodal Emotion Classification with Multi-Level Semantic Reasoning Network

TCMT: Target-oriented Cross Modal Transformer for Multimodal Aspect-Based Sentiment Analysis

Learning from Adjective-Noun Pairs: A Knowledge-enhanced Framework for Target-Oriented Multimodal Sentiment Classification.

Targeted Aspect-Based Multimodal Sentiment Analysis: An Attention Capsule Extraction and Multi-Head Fusion Network

Adapting BERT for Target-Oriented Multimodal Sentiment Classification

Target-oriented Multimodal Sentiment Classification by Using Topic Model and Gating Mechanism.

A Multimodal Sentiment Analysis Method Integrating Multi-Layer Attention Interaction and Multi-Feature Enhancement

Multimodal sentiment analysis based on cross-instance graph neural networks

Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations

Multi-view sentiment classification of microblogs based on semantic features

MSFNet: modality smoothing fusion network for multimodal aspect-based sentiment analysis

Bidirectional Complementary Correlation-Based Multimodal Aspect-Level Sentiment Analysis

Value of US in selecting patients for carotid angioplasty and stent placement.

Multi-Channel Attentive Graph Convolutional Network with Sentiment Fusion for Multimodal Sentiment Analysis

Multi‐level Deep Correlative Networks for Multi‐modal Sentiment Analysis

Multi-layer cross-modality attention fusion network for multimodal sentiment analysis

Text-image semantic relevance identification for aspect-based multimodal sentiment analysis