Abstract:Automatically detecting emotional state in human speech, which plays an effective role in areas of human machine interactions, has been a difficult task for machine learning algorithms. Previous work for emotion recognition have mostly focused on the extraction of carefully hand-crafted and tailored features. Recently, spectrogram representations of emotion speech have achieved competitive performance for automatic speech emotion recognition. In this work we propose a method to tackle the problem of deep features, herein denoted as deep spectrum features, extraction from the spectrogram by leveraging Attention-based Bidirectional Long Short-Term Memory Recurrent Neural Networks with fully convolutional networks. The learned deep spectrum features are then fed into a deep neural network (DNN) to predict the final emotion. The proposed model is then evaluated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset to validate its effectiveness. Promising results indicate that our deep spectrum representations extracted from the proposed model perform the best, 65.2% for weighted accuracy and 68.0% for unweighted accuracy when compared to other existing methods. We then compare the performance of our deep spectrum features with two standard acoustic feature representations for speech-based emotion recognition. When combined with a support vector classifier, the performance of the deep feature representations extracted are comparable with the conventional features. Moreover, we also investigate the impact of different frequency resolutions of the input spectrogram on the performance of the system.

Syllable Level Speech Emotion Recognition Based on Formant Attention

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Self-attention Transfer Networks for Speech Emotion Recognition

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Deep Spectrum Feature Representations for Speech Emotion Recognition

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Speech Emotion Recognition Based on Clustering Assistance

Speech Emotion Recognition Based on Acoustic Segment Model.

Cross-Corpus Speech Emotion Recognition Based on Hybrid Neural Networks

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Multimodal Emotion Recognition from Raw Audio with Sinc-convolution

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios

Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation

Speech Emotion Recognition with Dual-Sequence LSTM Architecture

Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition