Learning Deep Multimodal Affective Features for Spontaneous Speech Emotion Recognition.

Shiqing Zhang,Xin Tao,Yuelong Chuang,Xiaoming Zhao
DOI: https://doi.org/10.1016/j.specom.2020.12.009
IF: 2.723
2021-01-01
Speech Communication
Abstract:Recently, spontaneous speech emotion recognition has become an active and challenging research subject. This paper proposes a new method of spontaneous speech emotion recognition by using deep multimodal audio feature learning based on multiple deep convolutional neural networks (multi-CNNs). The proposed method initially generates three different audio inputs for multi-CNNs so as to learn deep multimodal segment-level features from the original 1D audio signal in three aspects: 1) a 1D CNN for 1D raw waveform modeling, 2) a 2D CNN for 2D time-frequency Mel-spectrogram modeling, and 3) a 3D CNN for temporal-spatial dynamic modeling. Then, an average-pooling is performed on the obtained segment-level classification results from 1D, 2D, and 3D CNN networks, to produce utterance-level classification results. Finally, a score-level fusion strategy is adopted as a multi-CNN fusion method to integrate different utterance-level classification results for final emotion classification. The learned deep multimodal audio features are shown to be complementary to each other so that they are combined in a multi-CNN fusion network to achieve significantly improved emotion classification performance. Experiments are conducted on two challenging spontaneous emotional speech datasets, i.e., the AFEW5.0 and BAUM-1 s databases, demonstrating the promising performance of our proposed method.
What problem does this paper attempt to address?