Exploring Multimodal Sentiment Analysis via CBAM Attention and Double-layer BiLSTM Architecture

Huiru Wang,Xiuhong Li,Zenyu Ren,Dan Yang,chunming Ma
DOI: https://doi.org/10.48550/arXiv.2303.14708
2023-03-26
Abstract:Because multimodal data contains more modal information, multimodal sentiment analysis has become a recent research hotspot. However, redundant information is easily involved in feature fusion after feature extraction, which has a certain impact on the feature representation after fusion. Therefore, in this papaer, we propose a new multimodal sentiment analysis model. In our model, we use BERT + BiLSTM as new feature extractor to capture the long-distance dependencies in sentences and consider the position information of input sequences to obtain richer text features. To remove redundant information and make the network pay more attention to the correlation between image and text features, CNN and CBAM attention are added after splicing text features and picture features, to improve the feature representation ability. On the MVSA-single dataset and HFM dataset, compared with the baseline model, the ACC of our model is improved by 1.78% and 1.91%, and the F1 value is enhanced by 3.09% and 2.0%, respectively. The experimental results show that our model achieves a sound effect, similar to the advanced model.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the interference of redundant information in the feature fusion process in multimodal sentiment analysis. Specifically, when extracting features from text and images and fusing them, redundant information may be introduced, which will affect the feature representation ability after fusion and thus the accuracy of sentiment analysis. To solve this problem, the paper proposes a new multimodal sentiment analysis model. By using BERT and BiLSTM as feature extractors to capture long - distance dependencies in the text and considering the position information of the input sequence to obtain richer text features. In order to remove redundant information and make the network pay more attention to the correlation between image and text features, the model adds CNN and CBAM attention mechanisms after splicing text features and image features to improve the feature representation ability. Experimental results show that, compared with the baseline model, the accuracy (ACC) of this model on the MVSA - single dataset and the HFM dataset is increased by 1.78% and 1.91% respectively, and the F1 - value is increased by 3.09% and 2.0% respectively. These improvements indicate that the proposed model has achieved good results in the multimodal sentiment analysis task.