Abstract:Multimodal Aspect-Based Sentiment Analysis (MABSA) technology aims to utilize both textual and visual modalities to achieve Multimodal Aspect Term Extraction (MATE) and Multimodal Aspect Sentiment Classification (MASC) in tweets. Current research has overlooked the impact of noise from irrelevant regions in images on model performance. Additionally, there has been insufficient utilization of the textual information contained within images and the syntactic features of sentences. In this paper, we propose a Target-oriented Cross Modal Transformer (TCMT) for MABSA. The model consists of a textual auxiliary module, a visual auxiliary module, and a main module: the textual aspect-sentiment extraction module, the visual aspect-sentiment prediction module, and the textual-visual alignment cross-modal module. In the textual auxiliary module, we utilize syntactic features to assist the model in identifying the boundaries of multi-word aspect terms and employ Optical Character Recognition (OCR) technology to capture textual information contained within images. In the visual auxiliary module, we employ Adjective-Noun Pairs (ANPs) detection for supervised training of images. Additionally, we have improved the cross-modal Transformer structure by designing a GCN-based Transformer in the textual auxiliary module to learn syntactic graphs, and a CNN-based Transformer in the visual auxiliary module to focus more on important information in images. In the cross-modal MABSA module, we design a target-oriented interaction component to facilitate modal interaction learning and mitigate the impact of image noise, along with an alignment auxiliary component to optimize modal alignment training. We conducted extensive experiments on two publicly available benchmark datasets. The results demonstrate that the performance of the TCMT model is significantly superior to that of the baseline model, achieving state-of-the-art results. Both the textual auxiliary module and the visual auxiliary module effectively assist the cross-modal MABSA module in completing the task more efficiently.

Cascaded Cross-Modal Transformer for Audio-Textual Classification

Cascaded Cross-Modal Transformer for Request and Complaint Detection

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification

Multiresolution and Multimodal Speech Recognition with Transformers

TCMT: Target-oriented Cross Modal Transformer for Multimodal Aspect-Based Sentiment Analysis

CCATMos: Convolutional Context-aware Transformer Network for Non-intrusive Speech Quality Assessment

Multi-Modal Transformers Utterance-Level Code-Switching Detection

Audio-Visual Efficient Conformer for Robust Speech Recognition

Pre-training for Speech Translation: CTC Meets Optimal Transport

Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

Cross-modal Alignment with Optimal Transport for CTC-based ASR

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Dawn of the transformer era in speech emotion recognition: closing the valence gap

CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval.

Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition

CM-BERT