Abstract:Sarcasm Explanation in Dialogue (SED) is a new yet challenging task, which aims to generate a natural language explanation for the given sarcastic dialogue that involves multiple modalities (i.e., utterance, video, and audio). Although existing studies have achieved great success based on the generative pretrained language model BART, they overlook exploiting the sentiments residing in the utterance, video and audio, which are vital clues for sarcasm explanation. In fact, it is non-trivial to incorporate sentiments for boosting SED performance, due to three main challenges: 1) diverse effects of utterance tokens on sentiments; 2) gap between video-audio sentiment signals and the embedding space of BART; and 3) various relations among utterances, utterance sentiments, and video-audio sentiments. To tackle these challenges, we propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE. In particular, we first propose a lexicon-guided utterance sentiment inference module, where a heuristic utterance sentiment refinement strategy is devised. We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip. Thereafter, we devise a context-sentiment graph to comprehensively model the semantic relations among the utterances, utterance sentiments, and video-audio sentiments, to facilitate sarcasm explanation generation. Extensive experiments on the publicly released dataset WITS verify the superiority of our model over cutting-edge methods.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate natural - language explanations in dialogue to reveal sarcasm semantics (Sarcasm Explanation in Dialogue, SED). Specifically, the SED task aims to generate natural - language explanations for given sarcastic dialogues containing multiple modalities (i.e., utterances, videos, and audios). Although existing research has achieved remarkable success based on the pre - trained language model BART, they ignore the use of emotional information present in utterances, videos, and audios, which is crucial for sarcasm explanation.
### Main Challenges
The paper points out that using emotional information to enhance SED performance faces the following three main challenges:
1. **Diverse effects of utterance tokens on sentiments**: There are multiple types of words in utterances, such as transitional words (e.g., "but"), negative words (e.g., "not"), intensity words (e.g., "very"), and emotional words (e.g., "happy"), and these words make different contributions to the sentiment of the utterance. Therefore, how to analyze the diverse effects of these words on the sentiment of the utterance is an important challenge.
2. **Gap between video - audio sentiment signals and the embedding space of BART**: The emotional signals transmitted by video and audio modalities, such as facial expressions and intonation, do not match the semantic space of BART because BART is pre - trained purely on text corpora. Therefore, how to effectively inject emotional information into BART is an important challenge.
3. **Various semantic relations among utterances, utterance sentiments, and video - audio sentiments**: There are rich semantic relations among utterances, utterance sentiments, and video - audio sentiments (e.g., semantic associations of words in utterances and inconsistencies between utterance sentiments and corresponding video - audio sentiments), and these relations are very important for sarcasm explanation. How to model these relations to help understand the dialogue context and thus improve the quality of sarcasm explanation generation is also a key challenge.
### Solutions
To address the above challenges, the paper proposes a new emotion - enhanced graph - based multimodal sarcasm explanation framework, abbreviated as EDGE. The EDGE framework consists of four components:
1. **Lexicon - guided utterance sentiment inference**: Analyze the influence of different words on utterance sentiment through the BableSenticNet lexicon and adopt a heuristic strategy to refine utterance sentiment.
2. **Video - audio joint sentiment inference**: Infer the joint sentiment label of each video - audio segment by extending the multimodal sentiment analysis model JCA.
3. **Sentiment - enhanced context encoding**: Construct a context - sentiment graph to comprehensively model the semantic relations among utterances, utterance sentiments, and video - audio sentiments.
4. **Sarcasm explanation generation**: Use the BART decoder to generate sarcasm explanations in the dialogue.
### Experimental Results
The paper conducted extensive experiments on the publicly released WITS dataset, and the experimental results show that the proposed EDGE framework outperforms existing methods in performance.
### Contributions
1. Proposed a new emotion - enhanced graph - based multimodal sarcasm explanation framework (EDGE), which incorporates utterance sentiment and video - audio sentiment to enhance sarcasm semantic understanding.
2. Proposed a heuristic utterance sentiment refinement strategy that can analyze the influence of different words on utterance sentiment.
3. Constructed a context - sentiment graph that can comprehensively capture utterances, utterance sentiments.