Abstract:The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: <a class="link-external link-https" href="https://github.com/SiddhantBikram/MemeCLIP" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the complex understanding of multimodal images (especially images with text, i.e., "memes") in machine learning. Specifically, the paper focuses on how to classify memes related to the LGBTQ+ pride movement in multiple aspects, including: 1. **Detection of Hate Speech** 2. **Classifying the Targets of Hate Speech** 3. **Classification of Topical Stance** 4. **Detection of Intended Humor** The complexity of these issues lies in the fact that multimodal images contain not only visual information but also embedded textual information, which often has a high degree of contextual dependency and subjectivity. Therefore, existing unimodal or multimodal analysis methods often struggle to comprehensively and accurately understand and classify these images. ### Background and Motivation With the proliferation of social media platforms, the generation and dissemination of multimedia content have grown exponentially. In this digital ecosystem, memes, which combine humor, wit, and a rebellious edge with text-embedded images, have become an important medium for people to express opinions, share experiences, and participate in online activities. However, this freedom of expression has also led to the widespread issue of hate speech, particularly targeting individuals, organizations, and marginalized communities. As an important topic of online discussion, the LGBTQ+ movement sees memes used both as tools of solidarity and support, as well as means of resistance and satire. In this context, distinguishing between humor and harm becomes very challenging, as memes often tread the line between satire and offense, posing significant challenges for researchers and platforms. Moreover, previous attempts to suppress such content often result in discriminatory suppression of all LGBTQ+ content, which can harm the awareness and acceptance of this community. Therefore, understanding hate speech, viewpoints, and humor intentions in memes is crucial for creating an inclusive digital environment and combating online discrimination. ### Solution To address these challenges, the paper introduces a new dataset **PrideMM**, which contains 5,063 text-embedded images related to the LGBTQ+ movement, annotated with multifaceted labels for four tasks. Through this dataset, the authors aim to foster a deeper understanding of interactions through memes on social media and promote the development of multimodal content moderation methods to make the internet a safer space. Additionally, the paper proposes a new framework **MemeCLIP**, which leverages the knowledge of the pre-trained CLIP model and achieves efficient downstream learning through multiple lightweight modules. MemeCLIP demonstrates superior performance over previously proposed frameworks on two real-world datasets and is compared with GPT-4 in a zero-shot setting. ### Main Contributions 1. **Release of the PrideMM Dataset**: Contains 5,063 text-embedded images related to the LGBTQ+ movement. 2. **Benchmarking**: Conducted benchmarking of PrideMM using various unimodal and multimodal methods. 3. **Proposed MemeCLIP Framework**: Utilizes the frozen encoder of the CLIP model and lightweight modules for multimodal and multifaceted classification. Through these contributions, the paper provides new methods and tools for the multifaceted classification of multimodal images, aiding in better understanding and managing content on social media.

MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification

MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification

Multimodal Hate Speech Detection in Memes Using Contrastive Language-Image Pre-Training

MemeFier: Dual-stage Modality Fusion for Image Meme Classification

Improving Hateful Meme Detection through Retrieval-Guided Contrastive Learning

Meme Sentiment Analysis Enhanced with Multimodal Spatial Encoding and Facial Embedding

Meme-ingful Analysis: Enhanced Understanding of Cyberbullying in Memes Through Multimodal Explanations

CEFM: CLIP Encoded Fusion Model for multimodal humor recognition on memes

A Multimodal Framework for the Detection of Hateful Memes

Multimodal Multilabel Classification by CLIP

MIMIC: Misogyny Identification in Multimodal Internet Content in Hindi-English Code-Mixed Language

M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought

On the Evolution of (Hateful) Memes by Means of Multimodal Contrastive Learning

Hateful Memes Detection via Complementary Visual and Linguistic Networks

DiffCLIP: Few-shot Language-driven Multimodal Classifier

Multimodal sentiment analysis of english and hinglish memes

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

MATK: The Meme Analytical Tool Kit

MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched Contextualization

Overview of Memotion 3: Sentiment and Emotion Analysis of Codemixed Hinglish Memes

A Review of Vision-Language Models and their Performance on the Hateful Memes Challenge