Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function

Jian Huang,Ya Li,Jianhua Tao,Zheng Lian
DOI: https://doi.org/10.21437/interspeech.2018-1432
2018-01-01
Abstract:Automatic emotion recognition is a crucial element on understanding human behavior and interaction. Prior works on speech emotion recognition focus on exploring various feature sets and models. Compared with these methods, we propose a triplet framework based on Long Short-Term Memory Neural Network (LSTM) for speech emotion recognition. The system learns a mapping from acoustic features to discriminative embedding features, which are regarded as basis of testing with SVM. The proposed model is trained with triplet loss and supervised loss simultaneously. The triplet loss makes intra-class distance shorter and inter-class distance longer, and supervised loss incorporates class label information. In view of variable-length inputs, we explore three different strategies to handle this problem, and meanwhile make better use of temporal dynamic process information. Our experimental results on the Interactive Emotional Motion Capture (IEMOCAP) database reveal that the proposed methods are beneficial to performance improvement. We demonstrate promise of triplet framework for speech emotion recognition and present our analysis.
What problem does this paper attempt to address?