Abstract:Sarcasm is a type of irony, characterized by an inherent mismatch between the literal interpretation and the intended connotation. Though sarcasm detection in text has been extensively studied, there are situations in which textual input alone might be insufficient to perceive sarcasm. The inclusion of additional contextual cues, such as images, is essential to recognize sarcasm in social media data effectively. This study presents a novel framework for multimodal sarcasm detection that can process input triplets. Two components of these triplets comprise the input text and its associated image, as provided in the datasets. Additionally, a supplementary modality is introduced in the form of descriptive image captions. The motivation behind incorporating this visual semantic representation is to more accurately capture the discrepancies between the textual and visual content, which are fundamental to the sarcasm detection task. The primary contributions of this study are: (1) a robust textual feature extraction branch that utilizes a cross-lingual language model; (2) a visual feature extraction branch that incorporates a self-regulated residual ConvNet integrated with a lightweight spatially aware attention module; (3) an additional modality in the form of image captions generated using an encoder-decoder architecture capable of reading text embedded in images; (4) distinct attention modules to effectively identify the incongruities between the text and two levels of image representations; (5) multi-level cross-domain semantic incongruity representation achieved through feature fusion. Compared with cutting-edge baselines, the proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the sarcasm detection problem in a multimodal context. Specifically, traditional sarcasm detection methods mainly rely on text analysis, but in many cases, text information alone is not sufficient to accurately identify sarcasm. For example, on social media, the combination of text and images can provide richer context information, which is crucial for understanding sarcasm. However, existing multimodal sarcasm detection methods, when dealing with image information, often rely on simple image attributes or noun - adjective pairs, and these methods may not be able to fully capture the deep meaning in the images. Especially when the images contain text, the effectiveness of these traditional methods will be limited. To solve these problems, this research proposes a new framework, aiming to improve the accuracy of multimodal sarcasm detection by combining text, images and their descriptions (i.e., image captions). The main contributions of this framework include: 1. **Cross - language language model**: It is used to extract feature representations of text and image captions, and is especially suitable for processing text data containing code - mixing (such as English - Hindi mixing). 2. **Visual feature extraction branch**: A self - regulated residual ConvNet combined with a lightweight spatial attention module is adopted to enhance the quality of the feature maps extracted from images. 3. **Image caption generation**: An encoder - decoder architecture is used to generate detailed image descriptions as an additional modal input to more accurately capture the inconsistency between text and images. 4. **Multi - level cross - modal semantic inconsistency representation**: By fusing features at different levels, a multi - level representation of the inconsistency between text and images is achieved, thereby improving the accuracy of sarcasm detection. Through these innovations, the experimental results of this research on two public datasets (the Twitter multimodal sarcasm dataset and the MultiBully dataset) show that the proposed model has achieved a significant performance improvement in the sarcasm detection task, reaching accuracies of 92.89% and 64.48% respectively. This indicates that by introducing image captions as an additional modal input and combining advanced feature extraction and attention mechanisms, the accuracy and robustness of multimodal sarcasm detection can be effectively improved.

Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection

A Semantic Enhancement Framework for Multimodal Sarcasm Detection

Dual-level adaptive incongruity-enhanced model for multimodal sarcasm detection

VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features

MMSD-CAF: MultiModal Sarcasm Detection using CoAttention and Fusion Mechanisms

Multi-Modal Sarcasm Detection with Sentiment Word Embedding

Multimodal Sarcasm Detection via Hybrid Classifier with Optimistic Logic

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement

Sarcasm driven by sentiment: A sentiment-aware hierarchical fusion network for multimodal sarcasm detection

Detecting Sarcasm in Multimodal Social Platforms

FiLMing Multimodal Sarcasm Detection with Attention

A Survey of Multimodal Sarcasm Detection

Enhanced Semantic Representation Learning for Sarcasm Detection by Integrating Context-Aware Attention and Fusion Network

Attention-based multi-modal fusion sarcasm detection

Multi-modal sarcasm detection based on emotion perception and cross-modality attention fusion

Interpretable Multi-Head Self-Attention Architecture for Sarcasm Detection in Social Media

An attention-based, context-aware multimodal fusion method for sarcasm detection using inter-modality inconsistency

Mimicking the Brain's Cognition of Sarcasm From Multidisciplines for Twitter Sarcasm Detection

Mutual-Enhanced Incongruity Learning Network for Multi-Modal Sarcasm Detection

Interpretable Multi-Head Self-Attention model for Sarcasm Detection in social media

Multi-Modal Sarcasm Detection In Twitter With Hierarchical Fusion Model