SeeSpeech - See Emotions in The Speech.

Jianing Geng,Hao Zhu,Xiang-Yang Li
DOI: https://doi.org/10.1145/3456126.3456129
2021-01-01
Abstract:At present, the understanding of speech by machines mostly focuses on the understanding of semantics, but speech should also include emotions in the speech. Emotion can not only strengthen semantics, but can even change semantic information. The paper discusses how to realize the emotion classification, which is called SeeSpeech. SeeSpeech chooses MCEP as the speech emotion feature, and inputs it into CNN and Transformer respectively. In order to obtain richer features, CNN uses batch normalization, while Transformer uses layer normalization, and then combines the output of CNN and Transformer. Finally, the type of emotion is obtained through SoftMax. SeeSpeech obtained the highest classification accuracy rate of 97% on the RAVDESS data set, and also obtained the classification accuracy rate of 85% on the actual edge gateway test. It can be seen from the results that SeeSpeech has encouraging performance in speech emotion classification and has a wide range of application prospects in human-computer interaction.
What problem does this paper attempt to address?