Multi-Modal Multi-Cultural Dimensional Continues Emotion Recognition In Dyadic Interactions

Jinming Zhao,Ruichen Li,Shizhe Chen,Qin Jin
DOI: https://doi.org/10.1145/3266302.3266313
2018-01-01
Abstract:Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our solutions for the Cross-cultural Emotion Sub-challenge (CES) of Audio/Visual Emotion Challenge (AVEC) 2018. The videos were recorded in dyadic human-human interaction scenarios. In these complicated scenarios, a person's emotion state will be influenced by the interlocutor's behaviors, such as talking style/prosody, speech content, facial expression and body language. In this paper, we highlight two aspects of our solutions: 1) we explore multiple modalities's efficient deep learning features and use the LSTM network to capture the long-term temporal information. 2) we propose several multimodal interaction strategies to imitate the real interaction patterns for exploring which modality information of the interlocutor is effective, and we find the best interaction strategy which can make full use of the interlocutor's information. Our solutions achieve the best CCC performance of 0.704 and 0.783 on arousal and valence respectively on the challenge testing set of German, which significantly outperform the baseline system with corresponding CCC of 0.524 and 0.577 on arousal and valence, and which outperform the winner of the AVEC2017 with corresponding CCC of 0.675 and 0.756 on arousal and valence. The experimental results show that our proposed interaction strategies have strong generalization ability and can bring more robust performance.
What problem does this paper attempt to address?