Sequence-to-sequence Modelling for Categorical Speech Emotion Recognition Using Recurrent Neural Network

Xiaomin Chen,Wenjing Han,Huabin Ruan,Jiamu Liu,Haifeng Li,Dongmei Jiang
DOI: https://doi.org/10.1109/aciiasia.2018.8470325
2018-01-01
Abstract:To model the categorical speech emotion recognition tasks in a sequential approach, the first challenge is how to transfer the categorical label for each utterance into a label sequence. To settle this, we make a hypothesis that an utterance is consisting of emotional and non-emotional segments alternatively, and these non-emotional segments correspond to silent regions, short pauses, transits between phonemes, fricative phonemes, etc. With this hypothesis, we propose to treat an utterance's label sequence as a chain of two kinds of states: emotional states denoting emotional frames and Nulls denoting non-emotional frames. Then, we exploit a connectionist temporal classification based recurrent neural network (CTC-RNN) to automatically label and align an utterance's emotional segments with emotional labels, while non-emotional segments with non-emotional labels. Experimental results on the IEMOCAP corpus demonstrate the effectiveness of our proposed method compared to state-of-the-art emotion recognition algorithms.
What problem does this paper attempt to address?