Abstract:Currently, social media has become an important means for people to share their daily lives and express emotions. Influenced by the overall environment, individuals are no longer limited to single-text content, but are increasingly inclined towards conveying information through various modalities. However, existing multimodal sentiment analysis methods face several challenges.Firstly, most current methods for sentiment polarity judgment primarily rely on textual content. These models struggle to effectively associate common emotional features between text and images. For instance, regardless of the textual content inputted into the model, the resulting image features remain the same for a given image. Thus, the models fail to extract relevant features from the text to aid in better sentiment analysis.Secondly, social media data commonly exhibit inconsistencies between textual content and image descriptions, making it difficult to obtain visually sensitive textual representations.Thirdly, most existing image feature extraction methods employ ResNet (Deep residual network) to directly fuse image features with text, lacking the ability to adaptively select and focus on key information from inputted image data.To address the aforementioned challenges in multimodal sentiment analysis, the following improvements have been made:1.Drawing on the target-oriented mBERT (TomBERT) model, which focuses on multimodal aspect-level sentiment analysis, we have designed a text-image matching module that is tailored to the experimental needs of this study. This module serves as the primary component applied in multimodal sentiment analysis.2.We have incorporated the Image Captioning with Transformers (CATR) module to facilitate the conversion of images into textual representations. This enriches the information available from the text data and resolves the issue of text-content mismatch with image descriptions.3.The improved Convolutional Block Attention Module (CBAM) was integrated to introduce the capability of adaptive selection and concentration on essential image features.In order to address the aforementioned issues, a model named Image Captioning Joint Attention Mechanism (ICAM) was designed.

Multimodal Sentiment Analysis Based on Image Captioning and Attention Mechanism

Attention-based multi-level image and text sentiment analysis

A multimodal sentiment recognition method based on attention mechanism

Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism

Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations

Image and Text Aspect Level Multimodal Sentiment Classification Model Using Transformer and Multilayer Attention Interaction

Multimodal sentiment analysis based on multiple attention

Attention-Based Modality-Gated Networks for Image-Text Sentiment Analysis

Target-oriented Sentiment Classification with Sequential Cross-modal Semantic Graph

Context-Dependent Multimodal Sentiment Analysis Based on a Complex Attention Mechanism

Senti-Attend: Image Captioning using Sentiment and Attention

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Multimodal sentiment analysis based on multi-head attention mechanism

Exploring Multimodal Sentiment Analysis via CBAM Attention and Double-layer BiLSTM Architecture

Cross-modal image sentiment analysis via deep correlation of textual semantic

Multimodal Sentiment Analysis Based on Information Bottleneck and Attention Mechanisms

Bidirectional Complementary Correlation-Based Multimodal Aspect-Level Sentiment Analysis

Multimodal Sentiment Analysis With Image-Text Interaction Network

Multimodal Sentiment Analysis Based on a Cross-Modal Multihead Attention Mechanism

Multimodal Sentiment Analysis Based on BERT and ResNet

Multimodal Emotion Classification with Multi-Level Semantic Reasoning Network