Speech Emotion Recognition Based on Three-Channel Feature Fusion of CNN and BiLSTM.

Lilong Huang,Jing Dong,Dongsheng Zhou,Qiang Zhang
DOI: https://doi.org/10.1145/3390557.3394317
2020-01-01
Abstract:It is important for human-computer natural interaction that the computer can quickly and effectively identify the interactor's emotional state. We proposed a three-channel feature fusion algorithm based on Convolutional Neural Network and Bidirectional Long Short-Term Memory Network for speech emotion recognition. Mel-Frequency Cepstral Coefficients and its first and second-order difference features were input into three channels and trained respectively for exploiting the emotional information contained in the dynamic features. By splicing and fusing the features obtained by three channels, the recognition rates of proposed algorithm reached 96.562% and 92.34% on CASIA corpus with six emotional states and on EMO-DB corpus with seven emotional states respectively.
What problem does this paper attempt to address?