Multi-View Speech Emotion Recognition Via Collective Relation Construction

Mixiao Hou,Zheng Zhang,Qi Cao,David Zhang,Guangming Lu
DOI: https://doi.org/10.1109/TASLP.2021.3133196
2022-01-08
Abstract:Automatic emotion recognition from speech plays a fundamental role towards advanced emotional intelligence in human-machine interaction systems. The discriminative knowledge from speech for effective emotion recognition may come from multiple physical properties such as energy spectrum, frequency, prosody, which could be collected as multi-view representations. However, the current works fail to fully explore the underlying interactive relations among multiple speech representations for emotion recognition. In this paper, we propose a novel Collective Multi-view Relation Network (CMRN) to exploit the intrinsic characteristics of multi-view speech representations for discriminative speech emotion recognition. Generally, the proposed CMRN consists of three sub-networks, i.e., view-specific attention network, multi-view shared attention network and collective relation network. Specifically, the view-specific attention network is designed to excavate the distinguishable view-specific features deduced from the original speech. By contrast, the multi-view shared attention network is conceived to capture the collaborative knowledge from multiple views. Moreover, a well-designed collective relation network is explicitly constructed to characterize the shared-specific correlations, which could reflect the underlying physical interaction capabilities. As such, the decision phase can comprehensively leverage the shared and view-specific information of multiple representations, such that the final privileged deciding principle can aggregate the heterogeneous information of multi-view features to make accurate emotion recognition. Extensive experiments on two benchmark datasets demonstrate the superb performance of the proposed method in comparison with some state-of-the-art methods.
What problem does this paper attempt to address?