Abstract:Speech emotion recognition plays an increasingly important role in emotional computing and is still a challenging task due to its complexity. In this study, we developed a framework integrating three distinctive classifiers: a deep neural network (DNN), a convolution neural network (CNN), and a recurrent neural network (RNN). The framework was used for categorical recognition of four discrete emotions (i.e., angry, happy, neutral and sad). Frame-level low-level descriptors (LLDs), segment-level mel-spectrograms (MS), and utterance-level outputs of high-level statistical functions (HSFs) on LLDs were passed to RNN, CNN, and DNN, separately. Three individual models of LLD-RNN, MS-CNN, and HSF-DNN were obtained. In the models of MS-CNN and LLD-RNN, the attention mechanism based weighted-pooling method was utilized to aggregate the CNN and RNN outputs. To effectively utilize the interdependencies between the two approaches of emotion description (discrete emotion categories and continuous emotion attributes), a multi-task learning strategy was implemented in these three models to acquire generalized features by simultaneously operating classification of discrete categories and regression of continuous attributes. Finally, a confidence-based fusion strategy was developed to integrate the power of different classifiers in recognizing different emotional states. Three experiments on emotion recognition based on the IEMOCAP corpus were conducted. Our experimental results show that the weighted pooling method based on attention mechanism endowed the neural networks with the capability to focus on emotionally salient parts. The generalized features learned in the multi-task learning helped the neural networks to achieve higher accuracies in the tasks of emotion classification. Furthermore, our proposed fusion system achieved weighted accuracy of 57.1% and unweighted accuracy of 58.3%, which were significantly higher than those of each individual classifier. The effectiveness of the proposed approach based on classifier fusion was thus validated.

Multistage linguistic conditioning of convolutional layers for speech emotion recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Multi-type Features Separating Fusion Learning for Speech Emotion Recognition.

Speech Emotion Recognition Using Multi-Modal Feature Fusion Network

Graph-based multi-Feature fusion method for speech emotion recognition

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion

An autoencoder-based feature level fusion for speech emotion recognition

Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition

Fusion approaches for emotion recognition from speech using acoustic and text-based features

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

Speech Emotion Recognition Using Fusion of Three Multi-Task Learning-Based Classifiers: HSF-DNN, MS-CNN and LLD-RNN

Feature Fusion for Multimodal Emotion Recognition Based on Deep Canonical Correlation Analysis

Dual Memory Fusion for Multimodal Speech Emotion Recognition

Fusion Model for Speech Emotion Recognition with Low Level Descriptor Features

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion

Combining Multi-scale and Self-Supervised Features for Speech Emotion Recognition

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Ms-senet: Enhancing Speech Emotion Recognition Through Multi-scale Feature Fusion With Squeeze-and-excitation Blocks