SMIN: Semi-supervised Multi-modal Interaction Network for Conversational Emotion Recognition

Zheng Lian,Bin Liu,Jianhua Tao
DOI: https://doi.org/10.1109/taffc.2022.3141237
IF: 13.99
2022-01-01
IEEE Transactions on Affective Computing
Abstract:Conversational emotion recognition is a crucial research topic in human-computer interactions. Due to the heavy annotation cost and inevitable label ambiguity, collecting large amounts of labeled data is challenging and expensive, which restricts the performance of current fully-supervised methods in this domain. To address this problem, researchers attempt to distill knowledge from unlabeled data via semi-supervised learning. However, most of these semi-supervised methods ignore multimodal interactive information, although recent works have proven that such interactive information is essential for emotion recognition. To this end, we propose a novel framework to seamlessly integrate semi-supervised learning with multimodal interactions, called “Semi-supervised Multi-modal Interaction Network (SMIN)”. SMIN contains two well-designed semi-supervised modules, “Intra-modal Interactive Module (IIM)” and “Cross-modal Interactive Module (CIM)” to learn intra- and cross-modal interactions. These two modules leverage additional unlabeled data to extract emotion-salient representations. To capture additional contextual information, we utilize the hierarchical recurrent networks followed with the hybrid fusion strategy to integrate multimodal features. These multimodal features are further utilized for conversational emotion recognition. Experimental results on four benchmark datasets (i.e., IEMOCAP, MELD, CMU-MOSI and CMU-MOSEI) demonstrate that SMIN succeeds over existing state-of-the-art strategies on emotion recognition.
computer science, cybernetics, artificial intelligence
What problem does this paper attempt to address?