Research on Multimodal Sentiment Classification of Internet Memes Based on Transformer

Shengnan Chi,Guoming Sang,Xian Shi
DOI: https://doi.org/10.1145/3673277.3673354
2024-01-01
Abstract:In the past few years, internet memes have emerged as one of the most widely shared content on social media platforms. People use memes to express their emotional states, whether it's sharing opinions, conveying viewpoints, or showcasing attitudes. However, traditional methods for sentiment analysis of memes rely on directly feeding image and text features into fully connected layers and a classification layer with softmax activation. This approach, which involves directly connecting extracted image features in a multimodal fashion, overlooks the global context and semantic information in images, leading to a decline in sentiment analysis performance. To address these issues, this paper proposes a multimodal sentiment analysis framework named BERES. The framework leverages CRNN+CTC technology to extract text information from memes and utilizes the BERT language model and ResNet50 to learn text and visual features of meme images. To enhance the model's representation capability for input data, we introduce a Transformer-based visual enhancement module. Subsequently, by concatenating text features and image sequence features, they are input into a fusion layer consisting of six Transformer-Encoder layers to achieve a deeper fusion of text and image features. Extensive experiments on publicly available datasets demonstrate that the proposed model outperforms existing multimodal models.
What problem does this paper attempt to address?