Abstract:Multimodal aspect-based sentiment analysis (MABSA) aims to determine the sentiment polarity of each aspect mentioned in the text based on multimodal content. Various approaches have been proposed to model multimodal sentiment features for each aspect via modal interactions. However, most existing approaches have two shortcomings: (1) The representation gap between textual and visual modalities may increase the risk of misalignment in modal interactions; (2) In some examples where the image is not related to the text, the visual information may not enrich the textual modality when learning aspect-based sentiment features. In such cases, blindly leveraging information from visual modal may introduce noises in reasoning the aspect-based sentiment expressions. To tackle the shortcomings mentioned above, we propose an end-to-end MABSA framework with image conversion and noise filtration. Specifically, to bridge the representation gap in different modalities, we attempt to translate images into the input space of a pre-trained language model (PLM). To this end, we develop an image-to-text conversion module that can convert an image to an implicit sequence of token embedding. Moreover, an aspect-oriented filtration module is devised to alleviate the noise in the implicit token embeddings, which consists of two attention operations. The former aims to create an enhanced aspect embedding as a query, and the latter seeks to use this query to retrieve relevant auxiliary information from the implicit token embeddings to supplement the textual content. After filtering the noise, we leverage a PLM to encode the text, aspect, and image prompt derived from filtered implicit token embeddings as sentiment features to perform aspect-based sentiment prediction. Experimental results on two MABSA datasets show that our framework achieves state-of-the-art performance. Furthermore, extensive experimental analysis demonstrates the proposed framework has superior robustness and efficiency.

Multi-granularity Visual-Textual Jointly Modeling for Aspect-Level Multimodal Sentiment Analysis

Multi-Grained Fusion Network with Self-Distillation for Aspect-Based Multimodal Sentiment Analysis

MFSC: A Multimodal Aspect-Level Sentiment Classification Framework with Multi-Image Gate and Fusion Networks

An Interactive Attention Mechanism Fusion Network for Aspect-Based Multimodal Sentiment Analysis

Interactive Fusion Network with Recurrent Attention for Multimodal Aspect-based Sentiment Analysis.

MIECF: Multi-faceted information extraction and cross-mixture fusion for multimodal aspect-based sentiment analysis

Aspect-level multimodal sentiment analysis based on co-attention fusion

Multifeature Interactive Fusion Model for Aspect-Based Sentiment Analysis

Multi-grained Attention Network for Aspect-Level Sentiment Classification

Joint Multi-modal Aspect-Sentiment Analysis with Auxiliary Cross-modal Relation Detection.

Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis

Self-adaptive attention fusion for multimodal aspect-based sentiment analysis

Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis

Image and Text Aspect Level Multimodal Sentiment Classification Model Using Transformer and Multilayer Attention Interaction

Visual Enhancement Capsule Network for Aspect-based Multimodal Sentiment Analysis

Multi-selection Attention for Multimodal Aspect-level Sentiment Classification

Affective Knowledge Enhanced Multiple-Graph Fusion Networks for Aspect-based Sentiment Analysis

Co-attention Guided Local-Global Feature Fusion for Aspect-Level Multimodal Sentiment Analysis.

Image-to-Text Conversion and Aspect-Oriented Filtration for Multimodal Aspect-Based Sentiment Analysis

AMIFN: Aspect-guided Multi-view Interactions and Fusion Network for Multimodal Aspect-based Sentiment Analysis