Abstract:Automatic emotion recognition from speech plays a fundamental role towards advanced emotional intelligence in human-machine interaction systems. The discriminative knowledge from speech for effective emotion recognition may come from multiple physical properties such as energy spectrum, frequency, prosody, which could be collected as multi-view representations. However, the current works fail to fully explore the underlying interactive relations among multiple speech representations for emotion recognition. In this paper, we propose a novel Collective Multi-view Relation Network (CMRN) to exploit the intrinsic characteristics of multi-view speech representations for discriminative speech emotion recognition. Generally, the proposed CMRN consists of three sub-networks, i.e., view-specific attention network, multi-view shared attention network and collective relation network. Specifically, the view-specific attention network is designed to excavate the distinguishable view-specific features deduced from the original speech. By contrast, the multi-view shared attention network is conceived to capture the collaborative knowledge from multiple views. Moreover, a well-designed collective relation network is explicitly constructed to characterize the shared-specific correlations, which could reflect the underlying physical interaction capabilities. As such, the decision phase can comprehensively leverage the shared and view-specific information of multiple representations, such that the final privileged deciding principle can aggregate the heterogeneous information of multi-view features to make accurate emotion recognition. Extensive experiments on two benchmark datasets demonstrate the superb performance of the proposed method in comparison with some state-of-the-art methods.

Multi-dimensional Speaker Information Recognition with Multi-task Neural Network

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Identity, Gender, Age, and Emotion Recognition from Speaker Voice with Multi-task Deep Networks for Cognitive Robotics

Multi-task Learning for Text-Dependent Speaker Verification.

MMTrans-MT: A Framework for Multimodal Emotion Recognition Using Multitask Learning

Hybrid Multi-Task Learning for End-To-End Multimodal Emotion Recognition

Speech Emotion Recognition Via Attention-based DNN from Multi-Task Learning

Speaker Personality Recognition with Multimodal Explicit Many2many Interactions

Multi-View Speech Emotion Recognition Via Collective Relation Construction

Multi-head attention-based long short-term memory model for speech emotion recognition

Multimodal Speech Emotion Recognition Based on Multi-Scale MFCCs and Multi-View Attention Mechanism

A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space.

Multi-modal Correlated Network for Emotion Recognition in Speech

EEG emotion recognition via Identity based Multi-gate Mixture-of-Experts network.

Multimodal emotion recognition based on deep neural network

Multimodal Emotion Recognition Based on Multilevel Acoustic and Textual Information

A Lightweight Multi-modal Emotion Recognition Network Based on Multi-task Learning

Emotion recognition using support vector machine and deep neural network

I-Vector Based Speaker Gender Recognition

Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition.