Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion

Yanlin Liu,Aibin Chen,Guoxiong Zhou,Jizheng Yi,Jin Xiang,Yaru Wang
DOI: https://doi.org/10.1007/s11042-023-17829-x
IF: 2.577
2024-01-03
Multimedia Tools and Applications
Abstract:According to the problem that emotional features cannot be well represented by a single feature and it is difficult to extract in the task of Speech Emotion Recognition (SER), we propose a Feature-Level (FL) fusion method for 9 types of acoustic features and a combined CNN-LSTM network with attention (CNN-A-LSTM). The feature vector set after feature-level fusion is used as the input of CNN-A-LSTM network, which contains prosodic features and spectral features. High-level Statistical Functions (HSFs) are also added to receive global features. The feature extraction network CNN-A-LSTM can more effectively read time-series input and extract speech emotion information. Finally, Softmax is applied as a classifier to obtain the final emotion classification results. The experiment is verified on SAVEE and CASIA datasets. The experimental results show that the method in this paper has the best effect compared with the state-of-the-art, and the accuracy rates of 94.5% and 96.7% are respectively gained on SAVEE and CASIA. The above results prove the effectiveness of the algorithm in this paper with some generalization.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?