Multimodal Sentiment Analysis Based on Image Captioning and Attention Mechanism

Ye Sun,Guozhe Jin,Yahui Zhao,Rongyi Cui
DOI: https://doi.org/10.1109/iccasit58768.2023.10351606
2023-01-01
Abstract:Currently, social media has become an important means for people to share their daily lives and express emotions. Influenced by the overall environment, individuals are no longer limited to single-text content, but are increasingly inclined towards conveying information through various modalities. However, existing multimodal sentiment analysis methods face several challenges.Firstly, most current methods for sentiment polarity judgment primarily rely on textual content. These models struggle to effectively associate common emotional features between text and images. For instance, regardless of the textual content inputted into the model, the resulting image features remain the same for a given image. Thus, the models fail to extract relevant features from the text to aid in better sentiment analysis.Secondly, social media data commonly exhibit inconsistencies between textual content and image descriptions, making it difficult to obtain visually sensitive textual representations.Thirdly, most existing image feature extraction methods employ ResNet (Deep residual network) to directly fuse image features with text, lacking the ability to adaptively select and focus on key information from inputted image data.To address the aforementioned challenges in multimodal sentiment analysis, the following improvements have been made:1.Drawing on the target-oriented mBERT (TomBERT) model, which focuses on multimodal aspect-level sentiment analysis, we have designed a text-image matching module that is tailored to the experimental needs of this study. This module serves as the primary component applied in multimodal sentiment analysis.2.We have incorporated the Image Captioning with Transformers (CATR) module to facilitate the conversion of images into textual representations. This enriches the information available from the text data and resolves the issue of text-content mismatch with image descriptions.3.The improved Convolutional Block Attention Module (CBAM) was integrated to introduce the capability of adaptive selection and concentration on essential image features.In order to address the aforementioned issues, a model named Image Captioning Joint Attention Mechanism (ICAM) was designed.
What problem does this paper attempt to address?