Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation

Shihao Zou,Xianying Huang,Xudong Shen
2023-10-04
Abstract:Emotion Recognition in Conversation (ERC) plays an important role in driving the development of human-machine interaction. Emotions can exist in multiple modalities, and multimodal ERC mainly faces two problems: (1) the noise problem in the cross-modal information fusion process, and (2) the prediction problem of less sample emotion labels that are semantically similar but different categories. To address these issues and fully utilize the features of each modality, we adopted the following strategies: first, deep emotion cues extraction was performed on modalities with strong representation ability, and feature filters were designed as multimodal prompt information for modalities with weak representation ability. Then, we designed a Multimodal Prompt Transformer (MPT) to perform cross-modal information fusion. MPT embeds multimodal fusion information into each attention layer of the Transformer, allowing prompt information to participate in encoding textual features and being fused with multi-level textual information to obtain better multimodal fusion features. Finally, we used the Hybrid Contrastive Learning (HCL) strategy to optimize the model's ability to handle labels with few samples. This strategy uses unsupervised contrastive learning to improve the representation ability of multimodal fusion and supervised contrastive learning to mine the information of labels with few samples. Experimental results show that our proposed model outperforms state-of-the-art models in ERC on two benchmark datasets.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address two main challenges in Emotion Recognition in Conversation (ERC): 1. **Noise in the process of cross-modal information fusion**: When integrating information from different modalities (such as text, audio, and visual), the semantic gap between these modalities can introduce a significant amount of noise if ignored, thereby affecting the final emotion prediction performance. 2. **Prediction of emotion labels with few samples**: In ERC datasets, there are many emotion labels that are semantically similar but have few samples (e.g., fear and disgust, happiness and excitement). The small number of these labels makes it difficult for the model to accurately predict these emotion categories, thus affecting the overall prediction performance. To address these issues, the authors propose a new model called Multimodal Prompt Transformer with Hybrid Contrastive Learning (MPT-HCL), which aims to fully utilize the features of each modality, reduce noise generation, and improve the prediction capability for emotion labels with few samples. The specific strategies include: - **Deep emotion cue extraction**: Extract deep emotion cues from modalities with strong representation capabilities and design feature filters as multimodal prompt information for modalities with weak representation capabilities. - **Multimodal Prompt Transformer (MPT)**: Embed multimodal fusion information into each attention layer of the Transformer, allowing prompt information to participate in text feature encoding and fuse with multi-level text information to obtain better multimodal fusion features. - **Hybrid Contrastive Learning (HCL)**: Use unsupervised contrastive learning to enhance the representation capability of multimodal fusion and use supervised contrastive learning to mine information from labels with few samples, optimizing the model's ability to handle labels with few samples. Experimental results show that the proposed model outperforms existing state-of-the-art models on two benchmark datasets (IEMOCAP and MELD).