Speech emotion recognition based on optimized deep features of dual-channel complementary spectrogram

Juan Li,Xueying Zhang,Fenglian Li,Lixia Huang
DOI: https://doi.org/10.1016/j.ins.2023.119649
IF: 8.1
2023-09-10
Information Sciences
Abstract:Speech emotion recognition (SER) is an essential field of artificial intelligence . Although the Mel spectrogram is commonly used in SER, it emphasizes low-frequency emotional components. In this paper, we propose VMD-Teager-Mel (VTMel) spectrogram, which complements the Mel spectrogram by emphasizing high-frequency components. In addition, to reduce the redundancy of the acoustic features, we propose a convolutional neural network with a deep restricted Boltzmann machine (CNN-DBM) to obtain optimized deep features. Furthermore, a dual-channel complementary structure is proposed for SER. First, a CNN-DBM extracts optimized deep features from the Mel spectrogram, highlighting low-frequency components. Second, another CNN-DBM extracts optimized deep features from the VTMel spectrogram, highlighting high-frequency components. These features are spliced together and fed to a classifier. The experimental results on three public datasets (EMO-DB, SAVEE, and RAVDESS) reveal that the use of the merged features achieves better performance, confirming the complementarity between the Mel and VTMel spectrograms. The recognition accuracy using CNN-DBM optimized deep features is superior to that using deep features from CNN alone, demonstrating the superiority of the proposed method. Our experiments also show advantages of the proposed method compared with the state-of-the-art methods reported in the literature.
computer science, information systems
What problem does this paper attempt to address?