Abstract:Interactive fusion methods have been successfully applied to multimodal sentiment analysis, due to their ability to achieve data complementarity via interaction of different modalities. However, previous methods treat the information of each modality as a whole and usually treat them equally, failing to distinguish the contribution of different semantic regions in non-textual features towards textual features. It caused that the public regions fail to be captured and private regions are hard to be predicted only with textual. Meanwhile, these methods use sentiment-independent encoder to encode textual features, which may mistakenly identify syntactically irrelevant contextual words as clues for predicting sentiment. In this paper, we propose a coordinated-joint translation fusion framework with sentiment-interactive graph to solve these problems. Specifically, we generate a novel sentiment-interactive graph to incorporate sentiment associations between different words into the syntactic adjacency matrix. The relationships between nodes are no longer limited to the sole existence of syntactic associations but fully consider the interaction of sentiment between different words. Then, we design a coordinated-joint translation fusion module. This module utilizes a cross-modal masked attention mechanism to determine whether there is a correlation between the text and non-text inputs, thereby identifying the most relevant public semantic features in the visual and acoustic modalities corresponding to the text modality. Subsequently, a cross-modal translation-aware mechanism is used to calculate the differences between the visual and acoustic modalities features transformed into the text modality and the text modality itself, which allows us to reconstruct the visual and acoustic modalities towards text modality to obtain private semantic features. In addition, we construct a multimodal fusion layer to fuse textual features and non-textual public and private features to improve multimodal interaction effects. Experimental results on publicly available datasets CMU-MOSI and CMU-MOSEI illustrate that our proposed model achieve a best accuracy of 86.5% and 86.1%, and best F1 of 86.4% and 86.1%. A series of further analyses also indicate the proposed framework effectively improve the sentiment identification capability.

Prompt Fusion Interaction Transformer for Aspect-Based Multimodal Sentiment Analysis

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

An Interactive Attention Mechanism Fusion Network for Aspect-Based Multimodal Sentiment Analysis

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

Prompt Link Multimodal Fusion in Multimodal Sentiment Analysis

Interactive Fusion Network with Recurrent Attention for Multimodal Aspect-based Sentiment Analysis.

Multilayer interactive attention bottleneck transformer for aspect-based multimodal sentiment analysis

Multifeature Interactive Fusion Model for Aspect-Based Sentiment Analysis

Image-text sentiment analysis via deep multimodal attentive fusion.

Hierarchical Interactive Multimodal Transformer for Aspect-Based Multimodal Sentiment Analysis

MIECF: Multi-faceted information extraction and cross-mixture fusion for multimodal aspect-based sentiment analysis

Multi-Grained Fusion Network with Self-Distillation for Aspect-Based Multimodal Sentiment Analysis

Dual-Perspective Fusion Network for Aspect-Based Multimodal Sentiment Analysis

Self-adaptive attention fusion for multimodal aspect-based sentiment analysis

MSFNet: modality smoothing fusion network for multimodal aspect-based sentiment analysis

Coordinated-joint Translation Fusion Framework with Sentiment-Interactive Graph Convolutional Networks for Multimodal Sentiment Analysis

Image-to-Text Conversion and Aspect-Oriented Filtration for Multimodal Aspect-Based Sentiment Analysis

Few-shot Multimodal Sentiment Analysis based on Multimodal Probabilistic Fusion Prompts

Hierarchical Fusion Network with Enhanced Knowledge and Contrastive Learning for Multimodal Aspect-Based Sentiment Analysis on Social Media

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

Multimodal Transformer with Adaptive Modality Weighting for Multimodal Sentiment Analysis