Abstract:We aimed at learning deep emotion features to recognize speech emotion. Two convolutional neural network and long short-term memory (CNN LSTM) networks, one 1D CNN LSTM network and one 2D CNN LSTM network, were constructed to learn local and global emotion-related features from speech and log-mel spectrogram respectively. The two networks have the similar architecture, both consisting of four local feature learning blocks (LFLBs) and one long short-term memory (LSTM) layer. LFLB, which mainly contains one convolutional layer and one max-pooling layer, is built for learning local correlations along with extracting hierarchical correlations. LSTM layer is adopted to learn long-term dependencies from the learned local features. The designed networks, combinations of the convolutional neural network (CNN) and LSTM, can take advantage of the strengths of both networks and overcome the shortcomings of them, and are evaluated on two benchmark databases. The experimental results show that the designed networks achieve excellent performance on the task of recognizing speech emotion, especially the 2D CNN LSTM network outperforms the traditional approaches, Deep Belief Network (DBN) and CNN on the selected databases. The 2D CNN LSTM network achieves recognition accuracies of 95.33% and 95.89% on Berlin EmoDB of speaker-dependent and speaker-independent experiments respectively, which compare favourably to the accuracy of 91.6% and 92.9% obtained by traditional approaches; and also yields recognition accuracies of 89.16% and 52.14% on IEMOCAP database of speaker-dependent and speaker-independent experiments, which are much higher than the accuracy of 73.78% and 40.02% obtained by DBN and CNN.

Speech Emotion Recognition Based on Deep Belief Networks and Wavelet Packet Cepstral Coefficients

Feature Fusion Methods Research Based on Deep Belief Networks for Speech Emotion Recognition under Noise Condition

Speech Emotion Recognition Based on Coiflet Wavelet Packet Cepstral Coefficients.

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Improved Emotion Recognition With Novel Task-Oriented Wavelet Packet Features

Deep Spectrum Feature Representations for Speech Emotion Recognition

A Study of Deep Belief Network Based Chinese Speech Emotion Recognition

Adaptive Wavelet Packet Filter-Bank Based Acoustic Feature for Speech Emotion Recognition

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

Research on Speech Emotion Recognition Technology based on Deep and Shallow Neural Network

Speech Emotion Recognition Based on Multi-task Deep Feature Extraction and MKPCA Feature Fusion

Random Deep Belief Networks for Recognizing Emotions from Speech Signals

Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN.

Speech Emotion Recognition Research Based on Wavelet Neural Network for Robot Pet

End-To-End Speech Emotion Recognition Based On Neural Network

A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms

Speech emotion recognition using deep 1D & 2D CNN LSTM networks

Speech emotion recognition based on wavelet packet coefficient model

A Novel DBN Feature Fusion Model for Cross-Corpus Speech Emotion Recognition