Abstract:Sarcasm, sentiment and emotion are tightly coupled with each other in that one helps the understanding of another, which makes the joint recognition of sarcasm, sentiment and emotion in conversation a focus in the research in artificial intelligence (AI) and affective computing. Three main challenges exist: Context dependency, multimodal fusion and multitask interaction. However, most of the existing works fail to explicitly leverage and model the relationships among related tasks. In this paper, we aim to generically address the three problems with a multimodal joint framework. We thus propose a multimodal multitask learning model based on the encoder–decoder architecture, termed M2Seq2Seq. At the heart of the encoder module are two attention mechanisms, i.e., intramodal ( Ia ) attention and intermodal ( Ie ) attention. Ia attention is designed to capture the contextual dependency between adjacent utterances, while Ie attention is designed to model multimodal interactions. In contrast, we design two kinds of multitask learning (MTL) decoders, i.e., single-level and multilevel decoders, to explore their potential. More specifically, the core of a single-level decoder is a masked outer-modal ( Or ) self-attention mechanism. The main motivation of Or attention is to explicitly model the interdependence among the tasks of sarcasm, sentiment and emotion recognition. The core of the multilevel decoder contains the shared gating and task-specific gating networks. Comprehensive experiments on four bench datasets, MUStARD, Memotion, CMU-MOSEI and MELD, prove the effectiveness of M2Seq2Seq over state-of-the-art baselines (e.g., CM-GCN, A-MTL) with significant improvements of 1.9%, 2.0%, 5.0%, 0.8%, 4.3%, 3.1%, 2.8%, 1.0%, 1.7% and 2.8% in terms of Micro F1.

A Bi-directional Multi-hop Inference Model for Joint Dialog Sentiment Classification and Act Recognition

Hierarchical and Bidirectional Joint Multi-Task Classifiers for Natural Language Understanding

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

DARER: Dual-task Temporal Relational Recurrent Reasoning Network for Joint Dialog Sentiment Classification and Act Recognition

A study of sentiment classification methods with dual model decision fusion

DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Multimodal interaction enhanced representation learning for video emotion recognition

Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition

Dual Causes Generation Assisted Model for Multimodal Aspect-Based Sentiment Classification

Exploring Multimodal Sentiment Analysis via CBAM Attention and Double-layer BiLSTM Architecture

Cross-Modal Sentiment Sensing with Visual-Augmented Representation and Diverse Decision Fusion

Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion

SMIN: Semi-supervised Multi-modal Interaction Network for Conversational Emotion Recognition

Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations

Dual-View Multimodal Interaction in Multimodal Sentiment Analysis

Multi-modal Sentiment and Emotion Joint Analysis with a Deep Attentive Multi-task Learning Model

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism