Abstract:Multimodal Aspect-Based Sentiment Analysis (MABSA) technology aims to utilize both textual and visual modalities to achieve Multimodal Aspect Term Extraction (MATE) and Multimodal Aspect Sentiment Classification (MASC) in tweets. Current research has overlooked the impact of noise from irrelevant regions in images on model performance. Additionally, there has been insufficient utilization of the textual information contained within images and the syntactic features of sentences. In this paper, we propose a Target-oriented Cross Modal Transformer (TCMT) for MABSA. The model consists of a textual auxiliary module, a visual auxiliary module, and a main module: the textual aspect-sentiment extraction module, the visual aspect-sentiment prediction module, and the textual-visual alignment cross-modal module. In the textual auxiliary module, we utilize syntactic features to assist the model in identifying the boundaries of multi-word aspect terms and employ Optical Character Recognition (OCR) technology to capture textual information contained within images. In the visual auxiliary module, we employ Adjective-Noun Pairs (ANPs) detection for supervised training of images. Additionally, we have improved the cross-modal Transformer structure by designing a GCN-based Transformer in the textual auxiliary module to learn syntactic graphs, and a CNN-based Transformer in the visual auxiliary module to focus more on important information in images. In the cross-modal MABSA module, we design a target-oriented interaction component to facilitate modal interaction learning and mitigate the impact of image noise, along with an alignment auxiliary component to optimize modal alignment training. We conducted extensive experiments on two publicly available benchmark datasets. The results demonstrate that the performance of the TCMT model is significantly superior to that of the baseline model, achieving state-of-the-art results. Both the textual auxiliary module and the visual auxiliary module effectively assist the cross-modal MABSA module in completing the task more efficiently.

MTCAM: A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model with Multi-Modal Transformer

A Multimodal Saliency Model For Videos With High Audio-Visual Correspondence

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

Transformer-based Multi-scale Feature Integration Network for Video Saliency Prediction

Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model

MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision Transformer

Audio-visual saliency prediction with multisensory perception and integration

MCT-VHD: Multi-modal contrastive transformer for video highlight detection

VST++: Efficient and Stronger Visual Saliency Transformer

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

From Discrete Representation to Continuous Modeling: A Novel Audio-Visual Saliency Prediction Model with Implicit Neural Representations

Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification

TCMT: Target-oriented Cross Modal Transformer for Multimodal Aspect-Based Sentiment Analysis

UniST: Towards Unifying Saliency Transformer for Video Saliency Prediction and Detection

Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos

MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo

Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection

Attention-Guided Contrastive Masked Image Modeling for Transformer-Based Self-Supervised Learning

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network