Abstract:Recent years have witnessed the booming of online social media platforms with embracing the popular service called “Time-Sync Comment”, which supports the viewers to share their time-sync opinions along with video content. In this way, we observe that numerous semantically-altered terms, or “Memes”, were created by niche users to express their unique ideas and emotions, and further attracted a large group of viewers with better activity and enthusiasm. Unfortunately, since the memes were created based on domain-specific knowledge and semantically varied depending on the multimodal context in videos, newcomers may fail to comprehend the semantic connotation of memes, which may severely impair their user-experiences. To deal with this issue, in this article, we propose a novel meme explanation framework, called ProMDE, to automatically capture and comprehend the memes in time-sync comments, which could further benefit the viewers with meme explanation service. Specifically, we first iteratively reconstruct the original time-sync comments compared with visual embedding to detect the semantically-altered terms as meme candidates. Afterward, based on the guides from the domain-specific corpus, visual and textual features will be fused to represent the context-aware multimodal cues. Moreover, to accurately describe the commonly-seen homophones in memes, i.e., they have the same pronunciation but different word-spelling expressions, we integrate the phonetic symbols as an additional modality to enhance the framework. Finally, we utilize a Transformer-based decoder to generate the natural language explanation for captured memes. Extensive experiments on a large real-world dataset prove that our framework could significantly outperform several state-of-the-art baseline methods, demonstrating the efficacy of modeling multimodal context and pronunciation for meme detection and explanation.

CEFM: CLIP Encoded Fusion Model for multimodal humor recognition on memes

Emotion-Aware Multimodal Fusion for Meme Emotion Detection

Multimodal Cross-Lingual Features and Weight Fusion for Cross-Cultural Humor Detection

Meme Sentiment Analysis Enhanced with Multimodal Spatial Encoding and Facial Embedding

MemeFier: Dual-stage Modality Fusion for Image Meme Classification

Multimodal sentiment analysis of english and hinglish memes

Hateful Memes Detection via Complementary Visual and Linguistic Networks

Overview of Memotion 3: Sentiment and Emotion Analysis of Codemixed Hinglish Memes

Towards Expressive Communication with Internet Memes: A New Multimodal Conversation Dataset and Benchmark

A Review of Vision-Language Models and their Performance on the Hateful Memes Challenge

MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification

Research on Image-text Multimodal Emotions Analysis with Fused Emoji

XMeCap: Meme Caption Generation with Sub-Image Adaptability

VIEMF: Multimodal metaphor detection via visual information enhancement with multimodal fusion

A Multimodal Framework for the Detection of Hateful Memes

Exercise? I thought you said 'Extra Fries': Leveraging Sentence Demarcations and Multi-hop Attention for Meme Affect Analysis

Comprehending the Gossips: Meme Explanation in Time-Sync Video Comment via Multimodal Cues

CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection

MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched Contextualization