Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Zhen-Tao Liu,Abdul Rehman,Min Wu,Wei-Hua Cao,Man Hao
DOI: https://doi.org/10.1016/j.ins.2021.02.016
IF: 8.1
2021-01-01
Information Sciences
Abstract:Speech Emotion Recognition (SER) has numerous applications including human-robot interaction, online gaming, and health care assistance. While deep learning-based approaches achieve considerable precision, they often come with high computational and time costs. Indeed, feature learning strategies must search for important features in a large amount of speech data. In order to reduce these time and computational costs, we propose pre-processing step in which speech segments with similar formant characteristics are clustered together and labeled as the same phoneme. The phoneme occurrence rates in emotional utterances are then used as the input features for classifiers. Using six databases (EmoDB, RAVDESS, IEMOCAP, ShEMO, DEMoS and MSP-Improv) for evaluation, the level of accuracy is comparable to that of current state-of-the-art methods and the required training time was significantly reduced from hours to minutes. (c) 2021 Elsevier Inc. All rights reserved.
What problem does this paper attempt to address?