Abstract:Despite commendable achievements made by existing work, prevailing multimodal sarcasm detection studies rely more on textual content over visual information. It unavoidably induces spurious correlations between textual words and labels, thereby significantly hindering the models' generalization capability. To address this problem, we define the task of out-of-distribution (OOD) multimodal sarcasm detection, which aims to evaluate models' generalizability when the word distribution is different in training and testing settings. Moreover, we propose a novel debiasing multimodal sarcasm detection framework with contrastive learning, which aims to mitigate the harmful effect of biased textual factors for robust OOD generalization. In particular, we first design counterfactual data augmentation to construct the positive samples with dissimilar word biases and negative samples with similar word biases. Subsequently, we devise an adapted debiasing contrastive learning mechanism to empower the model to learn robust task-relevant features and alleviate the adverse effect of biased words. Extensive experiments show the superiority of the proposed framework.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the bias problem in Multimodal Sarcasm Detection (MSD). Specifically, the existing multimodal sarcasm detection research relies too much on textual information and ignores visual information, resulting in poor performance of the model when the training and testing data distributions are inconsistent (i.e., Out - of - Distribution (OOD) scenarios). This over - reliance on textual information can cause the model to be affected by unreliable cues in the text (such as bias words), thus leading to incorrect predictions. ### Background and motivation With the rise of social media, people are increasingly using sarcastic expressions to voice their opinions on platforms. Therefore, accurately detecting sarcastic expressions has become particularly important for sentiment analysis and opinion mining. Early research mainly focused on pure - text methods, but with the development of multimedia devices, people began to express emotions and opinions through multimodal content (text and image). Image content often carries key cues for conveying sarcasm, which makes multimodal sarcasm detection a research hotspot. However, the existing multimodal sarcasm detection models still have the following problems: 1. **Over - reliance on textual information**: Existing models tend to rely more on textual information rather than visual information, which makes the models vulnerable to the influence of bias words in the text, thus affecting their generalization ability. 2. **Poor performance in OOD scenarios**: When the training and testing data distributions are inconsistent, the performance of the models will decline significantly because they rely on spurious correlations in the training data. ### Solutions To solve the above problems, the author proposes a new task - OOD multimodal sarcasm detection, and designs a new de - biasing multimodal sarcasm detection framework (DMSD - CL) that combines contrastive learning techniques. The specific methods are as follows: 1. **Counterfactual data augmentation**: Construct positive and negative samples by generating samples with similar bias words but opposite labels, and samples with different bias words but the same label. 2. **Adaptive de - biasing contrastive learning**: By re - weighting the contrastive learning loss function, the model can better distinguish samples with similar bias words but different labels, and narrow the gap between samples with different bias words but the same label. ### Experimental results The author conducted experiments on publicly available multimodal sarcasm detection benchmark datasets. The results show that the proposed DMSD - CL framework performs well on both the standard test set (IID) and the OOD test set, especially in OOD scenarios, its performance is significantly better than existing methods. ### Main contributions 1. **Defined a new OOD multimodal sarcasm detection task** to evaluate the true generalization ability of the model in OOD scenarios. 2. **Proposed a de - biasing multimodal sarcasm detection framework based on contrastive learning**, which improves the generalization ability of the model through counterfactual data augmentation and adaptive de - biasing contrastive learning. 3. **Constructed an OOD test set** and verified the effectiveness of the proposed method on this test set. ### Conclusion This paper effectively solves the problem of poor performance of existing models in OOD scenarios by introducing a new OOD multimodal sarcasm detection task and a de - biasing contrastive learning framework, providing new ideas and methods for research in the field of multimodal sarcasm detection.

Debiasing Multimodal Sarcasm Detection with Contrastive Learning

Fusion and Discrimination: A Multimodal Graph Contrastive Learning Framework for Multimodal Sarcasm Detection

Dual-level adaptive incongruity-enhanced model for multimodal sarcasm detection

TFCD: Towards Multi-modal Sarcasm Detection Via Training-Free Counterfactual Debiasing

Multi-View Incongruity Learning for Multimodal Sarcasm Detection

Multi-perspective Contrastive Learning Framework Guided by Sememe Knowledge and Label Information for Sarcasm Detection

Enhancing Cross-Lingual Sarcasm Detection by a Prompt Learning Framework with Data Augmentation and Contrastive Learning

A Semantic Enhancement Framework for Multimodal Sarcasm Detection

Learning Multi-Task Commonness and Uniqueness for Multi-Modal Sarcasm Detection and Sentiment Analysis in Conversation

Multi-Modal Sarcasm Detection Based on Contrastive Attention Mechanism

MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System

Multi-Modal Sarcasm Detection with Sentiment Word Embedding

MoBA: Mixture of Bi-directional Adapter for Multi-modal Sarcasm Detection

Mutual-Enhanced Incongruity Learning Network for Multi-Modal Sarcasm Detection

Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection

Modeling Incongruity Between Modalities for Multimodal Sarcasm Detection

Multimodal Sarcasm Detection via Hybrid Classifier with Optimistic Logic

Enhanced Semantic Representation Learning for Sarcasm Detection by Integrating Context-Aware Attention and Fusion Network

Sarcasm driven by sentiment: A sentiment-aware hierarchical fusion network for multimodal sarcasm detection

An attention-based, context-aware multimodal fusion method for sarcasm detection using inter-modality inconsistency

Sarcasm detection in social media based on imbalanced classification