TWACapsNet: a capsule network with two-way attention mechanism for speech emotion recognition

Xin-Cheng Wen,Kun-Hong Liu,Yan Luo,Jiaxin Ye,Liyan Chen
DOI: https://doi.org/10.1007/s00500-023-08957-5
IF: 3.732
2023-08-17
Soft Computing
Abstract:Speech Emotion Recognition (SER) is a challenging task, and the typical convolutional neural network (CNN) cannot well handle the speech data directly. Because CNN tends to understand local information and ignores the overall characteristics. This paper proposes a Capsule Network with Two-Way Attention MechanismTWACapsNet for short) for the SER problem. TWACapsNet accepts the spatial and spectral features as inputs, and the convolutional layer and the capsule layer are deployed to process these two types of features in two ways separately. After that, two attention mechanisms are designed to enhance the information obtained from the spatial and spectral features. Finally, the results of these two ways are combined to form the final decision. The advantage of TWACapsNet is verified by experiments on multiple SER data sets, and experimental results show that the proposed method outperforms the widely-deployed neural network models on three typical SER data sets. Furthermore, the combination of the two ways contributes to the higher and more stable performance of TWACapsNet.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?