Abstract:Speech emotion recognition plays an increasingly important role in emotional computing and is still a challenging task due to its complexity. In this study, we developed a framework integrating three distinctive classifiers: a deep neural network (DNN), a convolution neural network (CNN), and a recurrent neural network (RNN). The framework was used for categorical recognition of four discrete emotions (i.e., angry, happy, neutral and sad). Frame-level low-level descriptors (LLDs), segment-level mel-spectrograms (MS), and utterance-level outputs of high-level statistical functions (HSFs) on LLDs were passed to RNN, CNN, and DNN, separately. Three individual models of LLD-RNN, MS-CNN, and HSF-DNN were obtained. In the models of MS-CNN and LLD-RNN, the attention mechanism based weighted-pooling method was utilized to aggregate the CNN and RNN outputs. To effectively utilize the interdependencies between the two approaches of emotion description (discrete emotion categories and continuous emotion attributes), a multi-task learning strategy was implemented in these three models to acquire generalized features by simultaneously operating classification of discrete categories and regression of continuous attributes. Finally, a confidence-based fusion strategy was developed to integrate the power of different classifiers in recognizing different emotional states. Three experiments on emotion recognition based on the IEMOCAP corpus were conducted. Our experimental results show that the weighted pooling method based on attention mechanism endowed the neural networks with the capability to focus on emotionally salient parts. The generalized features learned in the multi-task learning helped the neural networks to achieve higher accuracies in the tasks of emotion classification. Furthermore, our proposed fusion system achieved weighted accuracy of 57.1% and unweighted accuracy of 58.3%, which were significantly higher than those of each individual classifier. The effectiveness of the proposed approach based on classifier fusion was thus validated.

Multi-type Features Separating Fusion Learning for Speech Emotion Recognition.

Speech Emotion Recognition Using Multi-Modal Feature Fusion Network

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Learning multi-scale features for speech emotion recognition with connection attention mechanism

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

Combining Multi-scale and Self-Supervised Features for Speech Emotion Recognition

An autoencoder-based feature level fusion for speech emotion recognition

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Speech emotion recognition based on optimized deep features of dual-channel complementary spectrogram

Speech Emotion Recognition Based on Multi-task Deep Feature Extraction and MKPCA Feature Fusion

Multimodal Speech Emotion Recognition Based on Multi-Scale MFCCs and Multi-View Attention Mechanism

Speech Emotion Recognition Using Fusion of Three Multi-Task Learning-Based Classifiers: HSF-DNN, MS-CNN and LLD-RNN

Multistage linguistic conditioning of convolutional layers for speech emotion recognition

Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion

Graph-based multi-Feature fusion method for speech emotion recognition

Multi-head attention fusion networks for multi-modal speech emotion recognition