Abstract:Currently, social media has become an important means for people to share their daily lives and express emotions. Influenced by the overall environment, individuals are no longer limited to single-text content, but are increasingly inclined towards conveying information through various modalities. However, existing multimodal sentiment analysis methods face several challenges.Firstly, most current methods for sentiment polarity judgment primarily rely on textual content. These models struggle to effectively associate common emotional features between text and images. For instance, regardless of the textual content inputted into the model, the resulting image features remain the same for a given image. Thus, the models fail to extract relevant features from the text to aid in better sentiment analysis.Secondly, social media data commonly exhibit inconsistencies between textual content and image descriptions, making it difficult to obtain visually sensitive textual representations.Thirdly, most existing image feature extraction methods employ ResNet (Deep residual network) to directly fuse image features with text, lacking the ability to adaptively select and focus on key information from inputted image data.To address the aforementioned challenges in multimodal sentiment analysis, the following improvements have been made:1.Drawing on the target-oriented mBERT (TomBERT) model, which focuses on multimodal aspect-level sentiment analysis, we have designed a text-image matching module that is tailored to the experimental needs of this study. This module serves as the primary component applied in multimodal sentiment analysis.2.We have incorporated the Image Captioning with Transformers (CATR) module to facilitate the conversion of images into textual representations. This enriches the information available from the text data and resolves the issue of text-content mismatch with image descriptions.3.The improved Convolutional Block Attention Module (CBAM) was integrated to introduce the capability of adaptive selection and concentration on essential image features.In order to address the aforementioned issues, a model named Image Captioning Joint Attention Mechanism (ICAM) was designed.

Image Emotion Caption Based on Visual Attention Mechanisms

Automatic Image Description Generation with Emotional Classifiers

SentiCap: Generating Image Descriptions with Sentiments

Image Captioning by Incorporating Affective Concepts Learned from Both Visual and Textual Components.

Image Captioning using Facial Expression and Attention

Visual Attention Based on Long-Short Term Memory Model for Image Caption Generation

Image Captioning with Emotional Information Via Multiple Model

Image Captioning with Affective Guiding and Selective Attention

Senti-Attend: Image Captioning using Sentiment and Attention

Image Captioning at Will: A Versatile Scheme for Effectively Injecting Sentiments into Image Descriptions

Image Emotion Analysis Based on Semantic Concepts

Generating Emotion Descriptions for Fine Art Paintings Via Multiple Painting Representations.

Fine-grained image emotion captioning based on Generative Adversarial Networks

Attention-based multi-level image and text sentiment analysis

SCEP—A New Image Dimensional Emotion Recognition Model Based on Spatial and Channel-Wise Attention Mechanisms

Leveraging facial expressions as emotional context in image captioning

Remote Sensing Image Captioning Based on Multi-Level Feature Extraction and Adaptive Attention

Emotional Video Captioning With Vision-Based Emotion Interpretation Network

Multimodal Sentiment Analysis Based on Image Captioning and Attention Mechanism

A Multi-Stage Visual Perception Approach for Image Emotion Analysis

Scene Attention Mechanism For Remote Sensing Image Caption Generation