MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification

Siddhant Bikram Shah,Shuvam Shiwakoti,Maheep Chaudhary,Haohan Wang
2024-10-28
Abstract:The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: <a class="link-external link-https" href="https://github.com/SiddhantBikram/MemeCLIP" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language,Multimedia
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the complex understanding of multimodal images (especially images with text, i.e., "memes") in machine learning. Specifically, the paper focuses on how to classify memes related to the LGBTQ+ pride movement in multiple aspects, including: 1. **Detection of Hate Speech** 2. **Classifying the Targets of Hate Speech** 3. **Classification of Topical Stance** 4. **Detection of Intended Humor** The complexity of these issues lies in the fact that multimodal images contain not only visual information but also embedded textual information, which often has a high degree of contextual dependency and subjectivity. Therefore, existing unimodal or multimodal analysis methods often struggle to comprehensively and accurately understand and classify these images. ### Background and Motivation With the proliferation of social media platforms, the generation and dissemination of multimedia content have grown exponentially. In this digital ecosystem, memes, which combine humor, wit, and a rebellious edge with text-embedded images, have become an important medium for people to express opinions, share experiences, and participate in online activities. However, this freedom of expression has also led to the widespread issue of hate speech, particularly targeting individuals, organizations, and marginalized communities. As an important topic of online discussion, the LGBTQ+ movement sees memes used both as tools of solidarity and support, as well as means of resistance and satire. In this context, distinguishing between humor and harm becomes very challenging, as memes often tread the line between satire and offense, posing significant challenges for researchers and platforms. Moreover, previous attempts to suppress such content often result in discriminatory suppression of all LGBTQ+ content, which can harm the awareness and acceptance of this community. Therefore, understanding hate speech, viewpoints, and humor intentions in memes is crucial for creating an inclusive digital environment and combating online discrimination. ### Solution To address these challenges, the paper introduces a new dataset **PrideMM**, which contains 5,063 text-embedded images related to the LGBTQ+ movement, annotated with multifaceted labels for four tasks. Through this dataset, the authors aim to foster a deeper understanding of interactions through memes on social media and promote the development of multimodal content moderation methods to make the internet a safer space. Additionally, the paper proposes a new framework **MemeCLIP**, which leverages the knowledge of the pre-trained CLIP model and achieves efficient downstream learning through multiple lightweight modules. MemeCLIP demonstrates superior performance over previously proposed frameworks on two real-world datasets and is compared with GPT-4 in a zero-shot setting. ### Main Contributions 1. **Release of the PrideMM Dataset**: Contains 5,063 text-embedded images related to the LGBTQ+ movement. 2. **Benchmarking**: Conducted benchmarking of PrideMM using various unimodal and multimodal methods. 3. **Proposed MemeCLIP Framework**: Utilizes the frozen encoder of the CLIP model and lightweight modules for multimodal and multifaceted classification. Through these contributions, the paper provides new methods and tools for the multifaceted classification of multimodal images, aiding in better understanding and managing content on social media.