InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

Junjie Chen,Hang Yu,Weidong Liu,Subin Huang,Sanmin Liu
2024-08-13
Abstract:The prevalence of sarcasm in social media, conveyed through text-image combinations, presents significant challenges for sentiment analysis and intention mining. Existing multi-modal sarcasm detection methods have been proven to overestimate performance, as they struggle to effectively capture the intricate sarcastic cues that arise from the interaction between an image and text. To address these issues, we propose InterCLIP-MEP, a novel framework for multi-modal sarcasm detection. Specifically, we introduce an Interactive CLIP (InterCLIP) as the backbone to extract text-image representations, enhancing them by embedding cross-modality information directly within each encoder, thereby improving the representations to capture text-image interactions better. Furthermore, an efficient training strategy is designed to adapt InterCLIP for our proposed Memory-Enhanced Predictor (MEP). MEP uses a dynamic, fixed-length dual-channel memory to store historical knowledge of valuable test samples during inference. It then leverages this memory as a non-parametric classifier to derive the final prediction, offering a more robust recognition of multi-modal sarcasm. Experiments demonstrate that InterCLIP-MEP achieves state-of-the-art performance on the MMSD2.0 benchmark, with an accuracy improvement of 1.08% and an F1 score improvement of 1.51% over the previous best method.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of multimodal sarcasm detection in social media. Specifically, existing multimodal sarcasm detection methods often overestimate their performance when dealing with the complex sarcastic cues conveyed by the combination of images and text. To tackle these issues, the authors propose a new framework named InterCLIP-MEP. The framework includes the following main components: 1. **Interactive CLIP (InterCLIP)**: Enhances the ability to capture text-image representations by embedding information from one modality into the encoder of the other modality. 2. **Memory-Enhanced Predictor (MEP)**: Utilizes a dynamic fixed-length dual-channel memory to store historical knowledge, acting as a non-parametric classifier during inference, thereby improving the robustness and reliability of multimodal sarcasm detection. Experimental results show that InterCLIP-MEP achieves state-of-the-art performance on the MMSD2.0 benchmark dataset, with an accuracy improvement of 1.08% and an F1 score improvement of 1.51% compared to the previous best method. This validates the effectiveness of the proposed method.