Abstract:The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small scale of the emotional speech dataset and the lack of effective emotional feature representation still limit the development of research. In this paper, a novel SER method, combining data augmentation, feature selection and feature fusion, is proposed. First, aiming at the problem that there are inadequate samples in the speech emotion dataset and the number of samples in each category is unbalanced, a speech data augmentation method, Mix-wav, is proposed which is applied to the audio of the same emotion category. Then, on the one hand, a Multi-Head Attention mechanism-based Convolutional Recurrent Neural Network (MHA-CRNN) model is proposed to further extract the spectrum vector from the Log-Mel spectrum. On the other hand, Light Gradient Boosting Machine (LightGBM) is used for feature set selection and feature dimensionality reduction in four emotion global feature sets, and more effective emotion statistical features are extracted for feature fusion with the previously extracted spectrum vector. Experiments are carried out on the public dataset Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Chinese Hierarchical Speech Emotion Dataset of Broadcasting (CHSE-DB). The experiments show that the proposed method achieves 66.44% and 93.47% of the unweighted average test accuracy, respectively. Our research shows that the global feature set after feature selection can supplement the features extracted by a single deep-learning model through feature fusion to achieve better classification accuracy.

Fusion Model for Speech Emotion Recognition with Low Level Descriptor Features

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Ann Based Decision Fusion for Speech Emotion Recognition

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Speech Emotion Recognition Based on Multi-feature and Multi-lingual Fusion

Multistage linguistic conditioning of convolutional layers for speech emotion recognition

Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

Classifier fusion for speech emotion recognition

MFHCA: Enhancing Speech Emotion Recognition Via Multi-Spatial Fusion and Hierarchical Cooperative Attention

A Low-rank Matching Attention based Cross-modal Feature Fusion Method for Conversational Emotion Recognition

Multimodal Emotion Recognition Using a Hierarchical Fusion Convolutional Neural Network

Speech emotion recognition using feature fusion: a hybrid approach to deep learning

A Novel DBN Feature Fusion Model for Cross-Corpus Speech Emotion Recognition

Fusion Of Global Statistical And Segmental Spectral Features For Speech Emotion Recognition

Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion