Abstract:Sarcasm is a form of sentiment expression that highlights the disparity between a person’s true intentions and the content they explicitly present. With the exponential increase in multimodal data on social platforms, the detection of sarcasm across various modes has become a pivotal area of research. Although previous studies have extensively examined multimodal feature extraction, fusion, and the modeling of inter-modal incongruities, they often neglected the subtle sentiment cues inherent in sarcastic multimodal data. Additionally, they did not adequately address the sparse distribution and tenuous connections between sarcastic features both within and cross modalities. To address these gaps, we introduce a hierarchical fusion model that integrates sentiment information for enhanced multimodal sarcasm detection. Specifically, we use attribute-object matching in the image modality, treating it as an auxiliary attribute modality. Sentiment data is then extracted from each modality and combined to achieve a more comprehensive representation within modalities. Moreover, we characterize the relationships of inter-modal incongruities using a crossmodal Transformer. We also implement a sentiment-aware image-text contrastive loss mechanism to synchronize the semantics of images and text better. By intensifying these alignments, our model is better equipped to understand incongruous relationships. Experiments demonstrate that our hierarchical fusion model achieves state-of-the-art performance on the multimodal sarcasm detection task.

Towards Multimodal Sarcasm Detection Via Label-Aware Graph Contrastive Learning with Back-Translation Augmentation

Debiasing Multimodal Sarcasm Detection with Contrastive Learning

Dual-level adaptive incongruity-enhanced model for multimodal sarcasm detection

Learning Multi-Task Commonness and Uniqueness for Multi-Modal Sarcasm Detection and Sentiment Analysis in Conversation

Enhancing Cross-Lingual Sarcasm Detection by a Prompt Learning Framework with Data Augmentation and Contrastive Learning

Multi-Modal Sarcasm Detection Based on Contrastive Attention Mechanism

Multi-View Incongruity Learning for Multimodal Sarcasm Detection

A Semantic Enhancement Framework for Multimodal Sarcasm Detection

Sarcasm driven by sentiment: A sentiment-aware hierarchical fusion network for multimodal sarcasm detection

Mutual-Enhanced Incongruity Learning Network for Multi-Modal Sarcasm Detection

Multimodal Sarcasm Detection via Hybrid Classifier with Optimistic Logic

An attention-based, context-aware multimodal fusion method for sarcasm detection using inter-modality inconsistency

MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System

KnowleNet: Knowledge fusion network for multimodal sarcasm detection

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement

Multi-Modal Sarcasm Detection with Sentiment Word Embedding

Attention-based multi-modal fusion sarcasm detection

CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models

Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper)

Multimodal Sarcasm Target Identification in Tweets.

Enhanced Semantic Representation Learning for Sarcasm Detection by Integrating Context-Aware Attention and Fusion Network