Abstract:Sarcasm, sentiment and emotion are tightly coupled with each other in that one helps the understanding of another, which makes the joint recognition of sarcasm, sentiment and emotion in conversation a focus in the research in artificial intelligence (AI) and affective computing. Three main challenges exist: Context dependency, multimodal fusion and multitask interaction. However, most of the existing works fail to explicitly leverage and model the relationships among related tasks. In this paper, we aim to generically address the three problems with a multimodal joint framework. We thus propose a multimodal multitask learning model based on the encoder–decoder architecture, termed M2Seq2Seq. At the heart of the encoder module are two attention mechanisms, i.e., intramodal ( Ia ) attention and intermodal ( Ie ) attention. Ia attention is designed to capture the contextual dependency between adjacent utterances, while Ie attention is designed to model multimodal interactions. In contrast, we design two kinds of multitask learning (MTL) decoders, i.e., single-level and multilevel decoders, to explore their potential. More specifically, the core of a single-level decoder is a masked outer-modal ( Or ) self-attention mechanism. The main motivation of Or attention is to explicitly model the interdependence among the tasks of sarcasm, sentiment and emotion recognition. The core of the multilevel decoder contains the shared gating and task-specific gating networks. Comprehensive experiments on four bench datasets, MUStARD, Memotion, CMU-MOSEI and MELD, prove the effectiveness of M2Seq2Seq over state-of-the-art baselines (e.g., CM-GCN, A-MTL) with significant improvements of 1.9%, 2.0%, 5.0%, 0.8%, 4.3%, 3.1%, 2.8%, 1.0%, 1.7% and 2.8% in terms of Micro F1.

Learning Multi-Task Commonness and Uniqueness for Multi-Modal Sarcasm Detection and Sentiment Analysis in Conversation

Multi-Modal Sarcasm Detection with Sentiment Word Embedding

A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations

An attention-based, context-aware multimodal fusion method for sarcasm detection using inter-modality inconsistency

Detect Sarcasm and Humor Jointly by Neural Multi-Task Learning

Mutual-Enhanced Incongruity Learning Network for Multi-Modal Sarcasm Detection

Multi-modal sarcasm detection based on emotion perception and cross-modality attention fusion

Sarcasm driven by sentiment: A sentiment-aware hierarchical fusion network for multimodal sarcasm detection

Multi-View Incongruity Learning for Multimodal Sarcasm Detection

Multi-Modal Sarcasm Detection In Twitter With Hierarchical Fusion Model

Attention-based multi-modal fusion sarcasm detection

Sentiment Analysis and Sarcasm Detection using Deep Multi-Task Learning

Multi-Modal Sarcasm Detection Based on Contrastive Attention Mechanism

A Semantic Enhancement Framework for Multimodal Sarcasm Detection

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement

MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System

Self-Adaptive Representation Learning Model for Multi-Modal Sentiment and Sarcasm Joint Analysis

MMSD-CAF: MultiModal Sarcasm Detection using CoAttention and Fusion Mechanisms

Enhanced Semantic Representation Learning for Sarcasm Detection by Integrating Context-Aware Attention and Fusion Network

Dual-level adaptive incongruity-enhanced model for multimodal sarcasm detection

KnowleNet: Knowledge fusion network for multimodal sarcasm detection