Abstract:Automatically detecting emotional state in human speech, which plays an effective role in areas of human machine interactions, has been a difficult task for machine learning algorithms. Previous work for emotion recognition have mostly focused on the extraction of carefully hand-crafted and tailored features. Recently, spectrogram representations of emotion speech have achieved competitive performance for automatic speech emotion recognition. In this work we propose a method to tackle the problem of deep features, herein denoted as deep spectrum features, extraction from the spectrogram by leveraging Attention-based Bidirectional Long Short-Term Memory Recurrent Neural Networks with fully convolutional networks. The learned deep spectrum features are then fed into a deep neural network (DNN) to predict the final emotion. The proposed model is then evaluated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset to validate its effectiveness. Promising results indicate that our deep spectrum representations extracted from the proposed model perform the best, 65.2% for weighted accuracy and 68.0% for unweighted accuracy when compared to other existing methods. We then compare the performance of our deep spectrum features with two standard acoustic feature representations for speech-based emotion recognition. When combined with a support vector classifier, the performance of the deep feature representations extracted are comparable with the conventional features. Moreover, we also investigate the impact of different frequency resolutions of the input spectrogram on the performance of the system.

Speech Emotion Classification with the Combination of Statistic Features and Temporal Features.

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Speech Emotion Recognition Using Acoustic Features

Speech Emotion Recognition Based on Feature Selection and Extreme Learning Machine Decision Tree

Scores Selection for Emotional Speaker Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

Speech Emotion Recognition Based on Linear Discriminant Analysis and Support Vector Machine Decision Tree

Speech Emotion Recognition with Emotion-Pair Based Framework Considering Emotion Distribution Information in Dimensional Emotion Space.

Speech emotion recognition using hidden Markov models

Speech Emotion Recognition Based on a Fusion of All-Class and Pairwise-Class Feature Selection

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Fusion Of Global Statistical And Segmental Spectral Features For Speech Emotion Recognition

Speech Emotion Classification Using Attention-Based LSTM

Combining Feature Selection And Representation For Speech Emotion Recognition

Multi-stage Classification of Emotional Speech Motivated by a Dimensional Emotion Model

Speech emotion recognition: Features and classification models

A Study of Speech Emotion Recognition Based on Hybrid Algorithm

Speech Emotion Recognition And Intensity Estimation