A Semantic Enhancement Framework for Multimodal Sarcasm Detection

Weiyu Zhong,Zhengxuan Zhang,Qiaofeng Wu,Yun Xue,Qianhua Cai
DOI: https://doi.org/10.3390/math12020317
IF: 2.4
2024-01-19
Mathematics
Abstract:Sarcasm represents a language form where a discrepancy lies between the literal meanings and implied intention. Sarcasm detection is challenging with unimodal text without clearly understanding the context, based on which multimodal information is introduced to benefit detection. However, current approaches only focus on modeling text–image incongruity at the token level and use the incongruity as the key to detection, ignoring the significance of the overall multimodal features and textual semantics during processing. Moreover, semantic information from other samples with a similar manner of expression also facilitates sarcasm detection. In this work, a semantic enhancement framework is proposed to address image–text congruity by modeling textual and visual information at the multi-scale and multi-span token level. The efficacy of textual semantics in multimodal sarcasm detection is pronounced. Aiming to bridge the cross-modal semantic gap, semantic enhancement is performed by using a multiple contrastive learning strategy. Experiments were conducted on a benchmark dataset. Our model outperforms the latest baseline by 1.87% in terms of the F1-score and 1% in terms of accuracy.
mathematics
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address key challenges in multimodal sarcasm detection. Specifically: 1. **Insufficient Utilization of Multimodal Information**: Current methods mainly focus on the inconsistency between text and images at the word level and consider it as the key clue for sarcasm recognition, neglecting the importance of overall multimodal features and text semantics. 2. **Cross-Modal Semantic Gap**: The significant semantic gap between images and text affects the effectiveness of recognizing text-image consistency. To tackle these challenges, the authors propose a new Semantic Enhancement Framework (SEF) to improve multimodal sarcasm detection through the following methods: - **Multi-Scale and Multi-Span Text and Visual Information Modeling**: Modeling text and visual information at different scales and spans to capture more comprehensive semantic information. - **Contrastive Learning Strategy**: Optimizing multimodal representations through contrastive learning to reduce the semantic gap between visual and text modalities. - **Semantic Information Enhancement**: Enhancing semantic information using other samples within the same batch to improve the model's performance. Experimental results show that SEF outperforms the latest baseline methods on benchmark datasets, with improvements of 1.87% in F1 score and 1% in accuracy.