Abstract:Emotion Recognition in Conversation (ERC) plays an important role in driving the development of human-machine interaction. Emotions can exist in multiple modalities, and multimodal ERC mainly faces two problems: (1) the noise problem in the cross-modal information fusion process, and (2) the prediction problem of less sample emotion labels that are semantically similar but different categories. To address these issues and fully utilize the features of each modality, we adopted the following strategies: first, deep emotion cues extraction was performed on modalities with strong representation ability, and feature filters were designed as multimodal prompt information for modalities with weak representation ability. Then, we designed a Multimodal Prompt Transformer (MPT) to perform cross-modal information fusion. MPT embeds multimodal fusion information into each attention layer of the Transformer, allowing prompt information to participate in encoding textual features and being fused with multi-level textual information to obtain better multimodal fusion features. Finally, we used the Hybrid Contrastive Learning (HCL) strategy to optimize the model's ability to handle labels with few samples. This strategy uses unsupervised contrastive learning to improve the representation ability of multimodal fusion and supervised contrastive learning to mine the information of labels with few samples. Experimental results show that our proposed model outperforms state-of-the-art models in ERC on two benchmark datasets.

What problem does this paper attempt to address?

The paper attempts to address two main challenges in Emotion Recognition in Conversation (ERC): 1. **Noise in the process of cross-modal information fusion**: When integrating information from different modalities (such as text, audio, and visual), the semantic gap between these modalities can introduce a significant amount of noise if ignored, thereby affecting the final emotion prediction performance. 2. **Prediction of emotion labels with few samples**: In ERC datasets, there are many emotion labels that are semantically similar but have few samples (e.g., fear and disgust, happiness and excitement). The small number of these labels makes it difficult for the model to accurately predict these emotion categories, thus affecting the overall prediction performance. To address these issues, the authors propose a new model called Multimodal Prompt Transformer with Hybrid Contrastive Learning (MPT-HCL), which aims to fully utilize the features of each modality, reduce noise generation, and improve the prediction capability for emotion labels with few samples. The specific strategies include: - **Deep emotion cue extraction**: Extract deep emotion cues from modalities with strong representation capabilities and design feature filters as multimodal prompt information for modalities with weak representation capabilities. - **Multimodal Prompt Transformer (MPT)**: Embed multimodal fusion information into each attention layer of the Transformer, allowing prompt information to participate in text feature encoding and fuse with multi-level text information to obtain better multimodal fusion features. - **Hybrid Contrastive Learning (HCL)**: Use unsupervised contrastive learning to enhance the representation capability of multimodal fusion and use supervised contrastive learning to mine information from labels with few samples, optimizing the model's ability to handle labels with few samples. Experimental results show that the proposed model outperforms existing state-of-the-art models on two benchmark datasets (IEMOCAP and MELD).

Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

Multi-Modal Attentive Prompt Learning for Few-shot Emotion Recognition in Conversations

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

A Contextual Attention Network for Multimodal Emotion Recognition in Conversation

Contextual Information and Commonsense Based Prompt for Emotion Recognition in Conversation

CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation

Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation

TMFER: Multimodal Fusion Emotion Recognition Algorithm Based on Transformer

First-order Multi-label Learning with Cross-modal Interactions for Multimodal Emotion Recognition

Emotion recognition in conversations with emotion shift detection based on multi-task learning

CTNet: Conversational Transformer Network for Emotion Recognition

Emotional Cues Extraction and Fusion for Multi-modal Emotion Prediction and Recognition in Conversation

SI-LSTM: Speaker Hybrid Long-short Term Memory and Cross Modal Attention for Emotion Recognition in Conversation

A Two-Stage Multimodal Emotion Recognition Model Based on Graph Contrastive Learning

Multimodal transformer augmented fusion for speech emotion recognition

A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation