A Semantic Enhancement Framework for Multimodal Sarcasm Detection

Weiyu Zhong,Zhengxuan Zhang,Qiaofeng Wu,Yun Xue,Qianhua Cai

DOI: https://doi.org/10.3390/math12020317

IF: 2.4

2024-01-19

Mathematics

Abstract:Sarcasm represents a language form where a discrepancy lies between the literal meanings and implied intention. Sarcasm detection is challenging with unimodal text without clearly understanding the context, based on which multimodal information is introduced to benefit detection. However, current approaches only focus on modeling text–image incongruity at the token level and use the incongruity as the key to detection, ignoring the significance of the overall multimodal features and textual semantics during processing. Moreover, semantic information from other samples with a similar manner of expression also facilitates sarcasm detection. In this work, a semantic enhancement framework is proposed to address image–text congruity by modeling textual and visual information at the multi-scale and multi-span token level. The efficacy of textual semantics in multimodal sarcasm detection is pronounced. Aiming to bridge the cross-modal semantic gap, semantic enhancement is performed by using a multiple contrastive learning strategy. Experiments were conducted on a benchmark dataset. Our model outperforms the latest baseline by 1.87% in terms of the F1-score and 1% in terms of accuracy.

mathematics

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address key challenges in multimodal sarcasm detection. Specifically: 1. **Insufficient Utilization of Multimodal Information**: Current methods mainly focus on the inconsistency between text and images at the word level and consider it as the key clue for sarcasm recognition, neglecting the importance of overall multimodal features and text semantics. 2. **Cross-Modal Semantic Gap**: The significant semantic gap between images and text affects the effectiveness of recognizing text-image consistency. To tackle these challenges, the authors propose a new Semantic Enhancement Framework (SEF) to improve multimodal sarcasm detection through the following methods: - **Multi-Scale and Multi-Span Text and Visual Information Modeling**: Modeling text and visual information at different scales and spans to capture more comprehensive semantic information. - **Contrastive Learning Strategy**: Optimizing multimodal representations through contrastive learning to reduce the semantic gap between visual and text modalities. - **Semantic Information Enhancement**: Enhancing semantic information using other samples within the same batch to improve the model's performance. Experimental results show that SEF outperforms the latest baseline methods on benchmark datasets, with improvements of 1.87% in F1 score and 1% in accuracy.

A Semantic Enhancement Framework for Multimodal Sarcasm Detection

Dual-level adaptive incongruity-enhanced model for multimodal sarcasm detection

Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection

Multi-Modal Sarcasm Detection with Sentiment Word Embedding

An attention-based, context-aware multimodal fusion method for sarcasm detection using inter-modality inconsistency

Sarcasm driven by sentiment: A sentiment-aware hierarchical fusion network for multimodal sarcasm detection

Enhanced Semantic Representation Learning for Sarcasm Detection by Integrating Context-Aware Attention and Fusion Network

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement

Mutual-Enhanced Incongruity Learning Network for Multi-Modal Sarcasm Detection

Multi-Modal Sarcasm Detection Based on Contrastive Attention Mechanism

Learning Multi-Task Commonness and Uniqueness for Multi-Modal Sarcasm Detection and Sentiment Analysis in Conversation

CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models

Detecting Sarcasm in Multimodal Social Platforms

MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System

A Survey of Multimodal Sarcasm Detection

KnowleNet: Knowledge fusion network for multimodal sarcasm detection

Multi-View Incongruity Learning for Multimodal Sarcasm Detection

Multi-modal sarcasm detection based on emotion perception and cross-modality attention fusion

Multi-Modal Sarcasm Detection In Twitter With Hierarchical Fusion Model

Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation

Sememe knowledge and auxiliary information enhanced approach for sarcasm detection