Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection

Sajal Aggarwal,Ananya Pandey,Dinesh Kumar Vishwakarma
2024-08-06
Abstract:Sarcasm is a type of irony, characterized by an inherent mismatch between the literal interpretation and the intended connotation. Though sarcasm detection in text has been extensively studied, there are situations in which textual input alone might be insufficient to perceive sarcasm. The inclusion of additional contextual cues, such as images, is essential to recognize sarcasm in social media data effectively. This study presents a novel framework for multimodal sarcasm detection that can process input triplets. Two components of these triplets comprise the input text and its associated image, as provided in the datasets. Additionally, a supplementary modality is introduced in the form of descriptive image captions. The motivation behind incorporating this visual semantic representation is to more accurately capture the discrepancies between the textual and visual content, which are fundamental to the sarcasm detection task. The primary contributions of this study are: (1) a robust textual feature extraction branch that utilizes a cross-lingual language model; (2) a visual feature extraction branch that incorporates a self-regulated residual ConvNet integrated with a lightweight spatially aware attention module; (3) an additional modality in the form of image captions generated using an encoder-decoder architecture capable of reading text embedded in images; (4) distinct attention modules to effectively identify the incongruities between the text and two levels of image representations; (5) multi-level cross-domain semantic incongruity representation achieved through feature fusion. Compared with cutting-edge baselines, the proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the sarcasm detection problem in a multimodal context. Specifically, traditional sarcasm detection methods mainly rely on text analysis, but in many cases, text information alone is not sufficient to accurately identify sarcasm. For example, on social media, the combination of text and images can provide richer context information, which is crucial for understanding sarcasm. However, existing multimodal sarcasm detection methods, when dealing with image information, often rely on simple image attributes or noun - adjective pairs, and these methods may not be able to fully capture the deep meaning in the images. Especially when the images contain text, the effectiveness of these traditional methods will be limited. To solve these problems, this research proposes a new framework, aiming to improve the accuracy of multimodal sarcasm detection by combining text, images and their descriptions (i.e., image captions). The main contributions of this framework include: 1. **Cross - language language model**: It is used to extract feature representations of text and image captions, and is especially suitable for processing text data containing code - mixing (such as English - Hindi mixing). 2. **Visual feature extraction branch**: A self - regulated residual ConvNet combined with a lightweight spatial attention module is adopted to enhance the quality of the feature maps extracted from images. 3. **Image caption generation**: An encoder - decoder architecture is used to generate detailed image descriptions as an additional modal input to more accurately capture the inconsistency between text and images. 4. **Multi - level cross - modal semantic inconsistency representation**: By fusing features at different levels, a multi - level representation of the inconsistency between text and images is achieved, thereby improving the accuracy of sarcasm detection. Through these innovations, the experimental results of this research on two public datasets (the Twitter multimodal sarcasm dataset and the MultiBully dataset) show that the proposed model has achieved a significant performance improvement in the sarcasm detection task, reaching accuracies of 92.89% and 64.48% respectively. This indicates that by introducing image captions as an additional modal input and combining advanced feature extraction and attention mechanisms, the accuracy and robustness of multimodal sarcasm detection can be effectively improved.