Abstract:Sarcasm Explanation in Dialogue (SED) is a new yet challenging task, which aims to generate a natural language explanation for the given sarcastic dialogue that involves multiple modalities (i.e., utterance, video, and audio). Although existing studies have achieved great success based on the generative pretrained language model BART, they overlook exploiting the sentiments residing in the utterance, video and audio, which are vital clues for sarcasm explanation. In fact, it is non-trivial to incorporate sentiments for boosting SED performance, due to three main challenges: 1) diverse effects of utterance tokens on sentiments; 2) gap between video-audio sentiment signals and the embedding space of BART; and 3) various relations among utterances, utterance sentiments, and video-audio sentiments. To tackle these challenges, we propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE. In particular, we first propose a lexicon-guided utterance sentiment inference module, where a heuristic utterance sentiment refinement strategy is devised. We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip. Thereafter, we devise a context-sentiment graph to comprehensively model the semantic relations among the utterances, utterance sentiments, and video-audio sentiments, to facilitate sarcasm explanation generation. Extensive experiments on the publicly released dataset WITS verify the superiority of our model over cutting-edge methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate natural - language explanations in dialogue to reveal sarcasm semantics (Sarcasm Explanation in Dialogue, SED). Specifically, the SED task aims to generate natural - language explanations for given sarcastic dialogues containing multiple modalities (i.e., utterances, videos, and audios). Although existing research has achieved remarkable success based on the pre - trained language model BART, they ignore the use of emotional information present in utterances, videos, and audios, which is crucial for sarcasm explanation. ### Main Challenges The paper points out that using emotional information to enhance SED performance faces the following three main challenges: 1. **Diverse effects of utterance tokens on sentiments**: There are multiple types of words in utterances, such as transitional words (e.g., "but"), negative words (e.g., "not"), intensity words (e.g., "very"), and emotional words (e.g., "happy"), and these words make different contributions to the sentiment of the utterance. Therefore, how to analyze the diverse effects of these words on the sentiment of the utterance is an important challenge. 2. **Gap between video - audio sentiment signals and the embedding space of BART**: The emotional signals transmitted by video and audio modalities, such as facial expressions and intonation, do not match the semantic space of BART because BART is pre - trained purely on text corpora. Therefore, how to effectively inject emotional information into BART is an important challenge. 3. **Various semantic relations among utterances, utterance sentiments, and video - audio sentiments**: There are rich semantic relations among utterances, utterance sentiments, and video - audio sentiments (e.g., semantic associations of words in utterances and inconsistencies between utterance sentiments and corresponding video - audio sentiments), and these relations are very important for sarcasm explanation. How to model these relations to help understand the dialogue context and thus improve the quality of sarcasm explanation generation is also a key challenge. ### Solutions To address the above challenges, the paper proposes a new emotion - enhanced graph - based multimodal sarcasm explanation framework, abbreviated as EDGE. The EDGE framework consists of four components: 1. **Lexicon - guided utterance sentiment inference**: Analyze the influence of different words on utterance sentiment through the BableSenticNet lexicon and adopt a heuristic strategy to refine utterance sentiment. 2. **Video - audio joint sentiment inference**: Infer the joint sentiment label of each video - audio segment by extending the multimodal sentiment analysis model JCA. 3. **Sentiment - enhanced context encoding**: Construct a context - sentiment graph to comprehensively model the semantic relations among utterances, utterance sentiments, and video - audio sentiments. 4. **Sarcasm explanation generation**: Use the BART decoder to generate sarcasm explanations in the dialogue. ### Experimental Results The paper conducted extensive experiments on the publicly released WITS dataset, and the experimental results show that the proposed EDGE framework outperforms existing methods in performance. ### Contributions 1. Proposed a new emotion - enhanced graph - based multimodal sarcasm explanation framework (EDGE), which incorporates utterance sentiment and video - audio sentiment to enhance sarcasm semantic understanding. 2. Proposed a heuristic utterance sentiment refinement strategy that can analyze the influence of different words on utterance sentiment. 3. Constructed a context - sentiment graph that can comprehensively capture utterances, utterance sentiments.

Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue

Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation

When did you become so smart, oh wise one?! Sarcasm Explanation in Multi-modal Multi-party Dialogues

Multi-Modal Sarcasm Detection with Sentiment Word Embedding

Enhanced Semantic Representation Learning for Sarcasm Detection by Integrating Context-Aware Attention and Fusion Network

A Multi-Level Embedding Framework for Decoding Sarcasm Using Context, Emotion, and Sentiment Feature

Sarcasm driven by sentiment: A sentiment-aware hierarchical fusion network for multimodal sarcasm detection

Dual-level adaptive incongruity-enhanced model for multimodal sarcasm detection

A Semantic Enhancement Framework for Multimodal Sarcasm Detection

An Effective Sarcasm Detection Approach Based on Sentimental Context and Individual Expression Habits

Learning Multi-Task Commonness and Uniqueness for Multi-Modal Sarcasm Detection and Sentiment Analysis in Conversation

Describe Images in a Boring Way: Towards Cross-Modal Sarcasm Generation.

Nice Perfume. How Long Did You Marinate in It? Multimodal Sarcasm Explanation

Sememe knowledge and auxiliary information enhanced approach for sarcasm detection

A Dual-Channel Framework for Sarcasm Recognition by Detecting Sentiment Conflict

Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection

Multi-modal sarcasm detection based on emotion perception and cross-modality attention fusion

VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features

Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve Multimodal Sarcasm Detection

A smart video analytical framework for sarcasm detection using novel adaptive fusion network and SarcasNet-99 model

How to Describe Images in a More Funny Way? Towards a Modular Approach to Cross-Modal Sarcasm Generation