Novelty fused image and text models based on deep neural network and transformer for multimodal sentiment analysis

Bui Thanh Hung,Nguyen Hoang Minh Thu
DOI: https://doi.org/10.1007/s11042-023-18105-8
IF: 2.577
2024-01-23
Multimedia Tools and Applications
Abstract:The rapid growth of various online platforms has made it easier than ever for people to share their feelings or opinions in the form of both textual and visual data on social networks. As a result, the data shared in the online environment usually holds obvious sentimental characteristics which then makes it a rich resource for the task of multimodal sentiment analysis. Compared to the traditional single modality sentiment analysis, the complementarity and interaction between multiple modalities will provide a more comprehensive analysis of the user's sentiment. In this paper, we propose a multimodal sentiment analysis model that fuses text and image by combining both feature and decision level fusion strategies. To this end, we use Bidirectional Encoder Representations from Transformers (BERT) to effectively obtain the semantic and context-aware features of text data. Simultaneously, deep convolutional neural network—DenseNet201 is used to obtain the representation of visual data. These features are fed to our three proposed advanced multimodal data fusion models: intermediate fusion, late fusion, and hybrid fusion models of sentiment analysis. Furthermore, we implement experiments with multiple other deep neural networks to extract high-level visual features in addition to textual features to enhance BERT features; finally, different data fusion strategies are utilized to gain a better understanding and to demonstrate the effectiveness of our proposed models.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?