Abstract:Speech emotion recognition plays an increasingly important role in emotional computing and is still a challenging task due to its complexity. In this study, we developed a framework integrating three distinctive classifiers: a deep neural network (DNN), a convolution neural network (CNN), and a recurrent neural network (RNN). The framework was used for categorical recognition of four discrete emotions (i.e., angry, happy, neutral and sad). Frame-level low-level descriptors (LLDs), segment-level mel-spectrograms (MS), and utterance-level outputs of high-level statistical functions (HSFs) on LLDs were passed to RNN, CNN, and DNN, separately. Three individual models of LLD-RNN, MS-CNN, and HSF-DNN were obtained. In the models of MS-CNN and LLD-RNN, the attention mechanism based weighted-pooling method was utilized to aggregate the CNN and RNN outputs. To effectively utilize the interdependencies between the two approaches of emotion description (discrete emotion categories and continuous emotion attributes), a multi-task learning strategy was implemented in these three models to acquire generalized features by simultaneously operating classification of discrete categories and regression of continuous attributes. Finally, a confidence-based fusion strategy was developed to integrate the power of different classifiers in recognizing different emotional states. Three experiments on emotion recognition based on the IEMOCAP corpus were conducted. Our experimental results show that the weighted pooling method based on attention mechanism endowed the neural networks with the capability to focus on emotionally salient parts. The generalized features learned in the multi-task learning helped the neural networks to achieve higher accuracies in the tasks of emotion classification. Furthermore, our proposed fusion system achieved weighted accuracy of 57.1% and unweighted accuracy of 58.3%, which were significantly higher than those of each individual classifier. The effectiveness of the proposed approach based on classifier fusion was thus validated.

Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speech Emotion Recognition Based on Feature Selection and Extreme Learning Machine Decision Tree

Speech Emotion Recognition Based on Multi-task Deep Feature Extraction and MKPCA Feature Fusion

An autoencoder-based feature level fusion for speech emotion recognition

Speech Emotion Recognition Based on Feature Fusion

Speech Emotion Recognition Based on Convolutional Neural Network and Feature Fusion

Speech Emotion Recognition Using Fusion of Three Multi-Task Learning-Based Classifiers: HSF-DNN, MS-CNN and LLD-RNN

Speech emotion recognition using feature fusion: a hybrid approach to deep learning

A Feature Fusion Method Based on Extreme Learning Machine for Speech Emotion Recognition

Speech Emotion Classification Using Acoustic Features

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion

Research on Deep Learning-based Speech Emotion Recognition System

Speech Emotion Recognition With Acoustic And Lexical Features

Multi-type Features Separating Fusion Learning for Speech Emotion Recognition.

Speech Emotion Recognition Based on Multi-feature and Multi-lingual Fusion