Emotion Recognition From Children Speech Signals Using Attention Based Time Series Deep Learning

Guitao Cao,Yunming Tang,Jiyu Sheng,Wenming Cao
DOI: https://doi.org/10.1109/BIBM47256.2019.8982992
2019-01-01
Abstract:Children's emotions expression concentrates in the acoustic aspects such as the tones and timbres of the voice instead of the semantics, and there are a lot of lengthy fragments in their speech. This paper proposes an emotion recognition model using the time series deep learning technology, named attention based Bi-directional Long Short-Term Memory (CNN-BiLSTM) to extract the emotional features. After preprocessing the speech signal, the forty-dimensional Mel Frequency Cepstral Coefficients (MFCC) related parameters are extracted, including the dynamic and static features. And these frequency domain features are enhanced by convolutional neural networks (CNNs) as the emotional features of children's speech recognition. BiLSTM is used to solve the problem of poor performance of long- term dependent learning features, and attention mechanism is used for only a few frames contain emotional features in the children speech signal. Compared with the related speech emotion recognition models such as LSTM-CNN and 2D-CNN-LSTM, our proposed speech emotion recognition model improves the accuracy up to 71.6% on the FAU-AIBO children's speech emotion database.
What problem does this paper attempt to address?