Abstract:The great variety of human emotional expression as well as the differences in the ways they perceive and annotate them make Speech Emotion Recognition (SER) an ambiguous and challenging task. With the development of deep learning, long-term progress has been made in supervised SER systems. However, the existing convolutional neural networks present certain limitations, such as their inability to well capture global features, which contain important emotional information. In addition, due to the subjective nature and continuity of emotion, the instance segments in which emotional speech is typically segmented do not fully reflect the true labels and cannot describe dynamic temporal changes. Thus, accurate emotional representation cannot be learnt in the process of feature extraction. In order to overtake these limitations, we propose an end-to-end network only for speech that maps sequences of different lengths to a fixed number of chunks and strictly preserves the order of chunks by adaptively adjusting their overlap. Subsequently, it extracts log-mel spectrogram features from chunk-level segments and feeds them into the Residual Multi-Scale Convolutional Neutral Networks with Transformer(RMSCTx) model framework. Finally, by keeping the order of the chunk-level segments, a temporal domain mean layer is used to further extract utterance-level feature representations. With this method, we perform multidimensional SER, i. e., the prediction of arousal, valence, and dominance. The experimental results on three popular corpora demonstrate not only the superiority of our approach, but also the robustness of the model for SER, showing an improvement of the recognition accuracy in the newest version of the public dataset MSP-Podcast (1.9).

SeeSpeech - See Emotions in The Speech.

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Fuzzy speech emotion recognition considering semantic awareness

Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism

Speech emotion recognition with deep convolutional neural networks

Speech Emotion Recognition Using Mel-Frequency Cepstral Coefficients & Convolutional Neural Networks

Speech Emotion Recognition with Complementary Acoustic Representations.

Speech Emotion Recognition Using Deep Learning

A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition

Improved Speech Emotion Classification Using Deep Neural Network

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

Speech emotion analysis using convolutional neural network (CNN) and gamma classifier-based error correcting output codes (ECOC)

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition