Abstract:International Journal of Pattern Recognition and Artificial Intelligence, Ahead of Print. Speech emotion recognition (SER) methods rely on frames to analyze the speech data. However, the existing methods typically divide a speech sample into smaller speech frames and label them with a single emotional tag, which fails to consider the possibility of multiple emotion tags coexisting within a speech sample. To deal with this limitation, we present a novel approach called self-labeling learning ensemble via DRNN and self-representation (En-DRNN-SR) for SER. This method automatically segments speech sample into speech frames, and then the deep recurrent neural network (DRNN) is applied to learn the deep features, and next the self-representation is built to get a relational degree matrix, finally the speech frames is divided into three parts using a relational degree matrix: the key emotional frames, the compatible emotional frames and the noise frames. The emotion tags of the compatible emotional frames are adaptive cyclic learned based on the key emotion frames vias the relational degree matrix, while also checking the emotion tags associated with the key compatible frames. Additionally, we introduce a new self-labeling criterion based on fuzzy membership degree for SER. To evaluate the feasibility and effectiveness of the proposed En-DRNN-SR, we conducted extensive experiments on IEMOCAP, EMODB, and SAVEE database, the proposed En-DRNN-SR obtains 69.13%, 82.83%, and 52.31% results on IEMOCAP, EMODB, and SAVEE database, which outperformed all competing algorithms. The experimental results clearly demonstrate that the proposed approach outperforms state-of-the-art SER methods, achieving superior performance on feature learning and classification.

Noise-label Suppressed Module for Speech Emotion Recognition.

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Speech Emotion Recognition Based on Acoustic Segment Model.

Speech Emotion Recognition Based on Clustering Assistance

Syllable Level Speech Emotion Recognition Based on Formant Attention

Selective Acoustic Feature Enhancement for Speech Emotion Recognition With Noisy Speech

Speech Emotion Recognition Based on Meta-Transfer Learning with Domain Adaption

Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition

Multimodal Emotion Recognition from Raw Audio with Sinc-convolution

Self-Labeling Learning Ensemble via Deep Recurrent Neural Network and Self-Representation for Speech Emotion Recognition

Using Speech Enhancement Preprocessing for Speech Emotion Recognition in Realistic Noisy Conditions

Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions

Learning Robust Self-attention Features for Speech Emotion Recognition with Label-adaptive Mixup

Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement