InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

Junjie Chen,Hang Yu,Weidong Liu,Subin Huang,Sanmin Liu

2024-08-13

Abstract:The prevalence of sarcasm in social media, conveyed through text-image combinations, presents significant challenges for sentiment analysis and intention mining. Existing multi-modal sarcasm detection methods have been proven to overestimate performance, as they struggle to effectively capture the intricate sarcastic cues that arise from the interaction between an image and text. To address these issues, we propose InterCLIP-MEP, a novel framework for multi-modal sarcasm detection. Specifically, we introduce an Interactive CLIP (InterCLIP) as the backbone to extract text-image representations, enhancing them by embedding cross-modality information directly within each encoder, thereby improving the representations to capture text-image interactions better. Furthermore, an efficient training strategy is designed to adapt InterCLIP for our proposed Memory-Enhanced Predictor (MEP). MEP uses a dynamic, fixed-length dual-channel memory to store historical knowledge of valuable test samples during inference. It then leverages this memory as a non-parametric classifier to derive the final prediction, offering a more robust recognition of multi-modal sarcasm. Experiments demonstrate that InterCLIP-MEP achieves state-of-the-art performance on the MMSD2.0 benchmark, with an accuracy improvement of 1.08% and an F1 score improvement of 1.51% over the previous best method.

Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the problem of multimodal sarcasm detection in social media. Specifically, existing multimodal sarcasm detection methods often overestimate their performance when dealing with the complex sarcastic cues conveyed by the combination of images and text. To tackle these issues, the authors propose a new framework named InterCLIP-MEP. The framework includes the following main components: 1. **Interactive CLIP (InterCLIP)**: Enhances the ability to capture text-image representations by embedding information from one modality into the encoder of the other modality. 2. **Memory-Enhanced Predictor (MEP)**: Utilizes a dynamic fixed-length dual-channel memory to store historical knowledge, acting as a non-parametric classifier during inference, thereby improving the robustness and reliability of multimodal sarcasm detection. Experimental results show that InterCLIP-MEP achieves state-of-the-art performance on the MMSD2.0 benchmark dataset, with an accuracy improvement of 1.08% and an F1 score improvement of 1.51% compared to the previous best method. This validates the effectiveness of the proposed method.

InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System

Multi-Modal Sarcasm Detection with Sentiment Word Embedding

Dual-level adaptive incongruity-enhanced model for multimodal sarcasm detection

Learning Multi-Task Commonness and Uniqueness for Multi-Modal Sarcasm Detection and Sentiment Analysis in Conversation

An attention-based, context-aware multimodal fusion method for sarcasm detection using inter-modality inconsistency

Multi-View Incongruity Learning for Multimodal Sarcasm Detection

MoBA: Mixture of Bi-directional Adapter for Multi-modal Sarcasm Detection

A Semantic Enhancement Framework for Multimodal Sarcasm Detection

Mimicking the Brain's Cognition of Sarcasm From Multidisciplines for Twitter Sarcasm Detection

MMSD-CAF: MultiModal Sarcasm Detection using CoAttention and Fusion Mechanisms

Sarcasm driven by sentiment: A sentiment-aware hierarchical fusion network for multimodal sarcasm detection

Fusion and Discrimination: A Multimodal Graph Contrastive Learning Framework for Multimodal Sarcasm Detection

Enhanced Semantic Representation Learning for Sarcasm Detection by Integrating Context-Aware Attention and Fusion Network

Knowledge-Enhanced Multi-perspective Incongruity Perception Network for Multimodal Sarcasm Detection

Mutual-Enhanced Incongruity Learning Network for Multi-Modal Sarcasm Detection

Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection

A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations

CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models

Multi-modal sarcasm detection based on emotion perception and cross-modality attention fusion

Attention-based multi-modal fusion sarcasm detection