Abstract:Multimodal Aspect-Based Sentiment Analysis (MABSA) technology aims to utilize both textual and visual modalities to achieve Multimodal Aspect Term Extraction (MATE) and Multimodal Aspect Sentiment Classification (MASC) in tweets. Current research has overlooked the impact of noise from irrelevant regions in images on model performance. Additionally, there has been insufficient utilization of the textual information contained within images and the syntactic features of sentences. In this paper, we propose a Target-oriented Cross Modal Transformer (TCMT) for MABSA. The model consists of a textual auxiliary module, a visual auxiliary module, and a main module: the textual aspect-sentiment extraction module, the visual aspect-sentiment prediction module, and the textual-visual alignment cross-modal module. In the textual auxiliary module, we utilize syntactic features to assist the model in identifying the boundaries of multi-word aspect terms and employ Optical Character Recognition (OCR) technology to capture textual information contained within images. In the visual auxiliary module, we employ Adjective-Noun Pairs (ANPs) detection for supervised training of images. Additionally, we have improved the cross-modal Transformer structure by designing a GCN-based Transformer in the textual auxiliary module to learn syntactic graphs, and a CNN-based Transformer in the visual auxiliary module to focus more on important information in images. In the cross-modal MABSA module, we design a target-oriented interaction component to facilitate modal interaction learning and mitigate the impact of image noise, along with an alignment auxiliary component to optimize modal alignment training. We conducted extensive experiments on two publicly available benchmark datasets. The results demonstrate that the performance of the TCMT model is significantly superior to that of the baseline model, achieving state-of-the-art results. Both the textual auxiliary module and the visual auxiliary module effectively assist the cross-modal MABSA module in completing the task more efficiently.

Layer-Level Progressive Transformer With Modality Difference Awareness for Multi-Modal Neural Machine Translation

Progressive modality-complement aggregative multitransformer for domain multi-modal neural machine translation

Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

Multimodal Transformer For Multimodal Machine Translation

Enhancing Neural Machine Translation with Dual-Side Multimodal Awareness

Contrastive Learning Based Visual Representation Enhancement for Multimodal Machine Translation

Bilingual–Visual Consistency for Multimodal Neural Machine Translation

Multimodal Pretraining from Monolingual to Multilingual

CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation

TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

TCMT: Target-oriented Cross Modal Transformer for Multimodal Aspect-Based Sentiment Analysis

Supervised Visual Attention for Simultaneous Multimodal Machine Translation

Increasing Visual Awareness in Multimodal Neural Machine Translation from an Information Theoretic Perspective

Multi-Domain Adaptation in Neural Machine Translation Through Multidimensional Tagging

Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers

Multimodal Transformer for Accelerated MR Imaging

DAS-CL: Towards Multimodal Machine Translation Via Dual-Level Asymmetric Contrastive Learning

EMMeTT: Efficient Multimodal Machine Translation Training

Multi-Hop Transformer for Document-Level Machine Translation