Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism

Hongchan Li,Yantong Lu,Haodong Zhu

DOI: https://doi.org/10.3390/electronics13112069

IF: 2.9

2024-05-28

Electronics

Abstract:Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.

engineering, electrical & electronic,physics, applied,computer science, information systems

What problem does this paper attempt to address?

The paper aims to address the problem of multimodal sentiment analysis. Specifically, the researchers noticed that information in real life is often multimodal (such as text and images), rather than just single-modal data. Although existing single-modal sentiment analysis methods have achieved significant success, they have limitations when dealing with multimodal data. Therefore, this paper proposes a Multimodal Cross Attention Mechanism (MCAM) based fusion model for sentiment analysis of images and text. The model first uses the ALBert pre-trained model to extract text features and employs BiLSTM to further extract text contextual features. Then, it extracts image features through DenseNet121 and uses the CBAM mechanism to obtain key regions related to sentiment in the images. Finally, a multimodal cross attention mechanism is used to fuse the extracted text and image sentiment features, and the prediction results are output through a classifier. Experimental results show that the model outperforms baseline models on the MVSA and TumEmo public datasets, with accuracy and F1 scores reaching 86.5% and 75.3%, and 85.5% and 76.7%, respectively. Additionally, ablation experiments further validate the superiority of multimodal fusion in sentiment analysis.

Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

Multimodal Sentiment Analysis Based on a Cross-Modal Multihead Attention Mechanism

Image-Text Multimodal Emotion Classification via Multi-View Attentional Network

Multi-Feature Fusion Multi-Modal Sentiment Analysis Model Based on Cross-Attention Mechanism

Multimodal sentiment analysis based on multiple attention

Exploring Multimodal Sentiment Analysis via CBAM Attention and Double-layer BiLSTM Architecture

Multi-layer cross-modality attention fusion network for multimodal sentiment analysis

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling

A Multimodal Sentiment Analysis Method Integrating Multi-Layer Attention Interaction and Multi-Feature Enhancement

Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention

A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning

Multimodal Sentiment Analysis of Graphic Texts Based on Multicategorical Relative Fusion

Research on cross-modal emotion recognition based on multi-layer semantic fusion

Cross-modality reinforcement for unaligned sequences sentiment analysis

A multimodal sentiment recognition method based on attention mechanism

Multi-modal Sentiment and Emotion Joint Analysis with a Deep Attentive Multi-task Learning Model

Attention-based multi-level image and text sentiment analysis

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Multimodal Sentiment Analysis Based on Composite Hierarchical Fusion