A Unimodal Valence-Arousal Driven Contrastive Learning Framework for Multimodal Multi-Label Emotion Recognition

Wenjie Zheng,Jianfei Yu,Rui Xia
DOI: https://doi.org/10.1145/3664647.3681638
2024-01-01
Abstract:Multimodal Multi-Label Emotion Recognition (MMER) aims to identify one or more emotion categories expressed by an utterance of a speaker. Despite obtaining promising results, previous studies on MMER represent each emotion category using a one-hot vector and ignore the intrinsic relations between emotions. Moreover, existing works mainly learn the unimodal representation based on the multimodal supervision signal of a single sample, failing to explicitly capture the unique emotional state of each modality as well as its emotional correlation between samples. To overcome these issues, we propose a Unimodal Valence-Arousal driven contrastive learning framework (UniVA) for the MMER task. Specifically, we adopt the valence-arousal (VA) space to represent each emotion category and regard the emotion correlation in the VA space as priors to learn the emotion category representation. Moreover, we employ pre-trained unimodal VA models to obtain the VA scores for each modality of the training samples, and then leverage the VA scores to construct positive and negative samples, followed by applying supervised contrastive learning to learn the VA-aware unimodal representations for multi-label emotion prediction. Experimental results on two benchmark datasets MOSEI and M3ED show that the proposed UniVA framework consistently outperforms a number of existing methods for the MMER task. The source code is publicly released at https://github.com/NUSTM/UniVA.
What problem does this paper attempt to address?