Abstract:The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small scale of the emotional speech dataset and the lack of effective emotional feature representation still limit the development of research. In this paper, a novel SER method, combining data augmentation, feature selection and feature fusion, is proposed. First, aiming at the problem that there are inadequate samples in the speech emotion dataset and the number of samples in each category is unbalanced, a speech data augmentation method, Mix-wav, is proposed which is applied to the audio of the same emotion category. Then, on the one hand, a Multi-Head Attention mechanism-based Convolutional Recurrent Neural Network (MHA-CRNN) model is proposed to further extract the spectrum vector from the Log-Mel spectrum. On the other hand, Light Gradient Boosting Machine (LightGBM) is used for feature set selection and feature dimensionality reduction in four emotion global feature sets, and more effective emotion statistical features are extracted for feature fusion with the previously extracted spectrum vector. Experiments are carried out on the public dataset Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Chinese Hierarchical Speech Emotion Dataset of Broadcasting (CHSE-DB). The experiments show that the proposed method achieves 66.44% and 93.47% of the unweighted average test accuracy, respectively. Our research shows that the global feature set after feature selection can supplement the features extracted by a single deep-learning model through feature fusion to achieve better classification accuracy.

Generative emotional AI for speech emotion recognition: The case for synthetic emotional speech augmentation

Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation

A Preliminary Study on Augmenting Speech Emotion Recognition using a Diffusion Model

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness

EMOTION CONTROLLABLE SPEECH SYNTHESIS USING EMOTION-UNLABELED DATASET WITH THE ASSISTANCE OF CROSS-DOMAIN SPEECH EMOTION RECOGNITION

Emotion-Aware Transformer Encoder for Empathetic Dialogue Generation

Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech

Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation

Emotional Prosody Control for Speech Generation

Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech

ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis

Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition

Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations

Exemplar-Based Emotive Speech Synthesis

Improving speaker verification robustness with synthetic emotional utterances

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

A Model of Emotional Speech Generation Based on Conditional Generative Adversarial Networks