Hierarchical speech emotion recognition using the valence-arousal model

Arijul Haque,K. Sreenivasa Rao
DOI: https://doi.org/10.1007/s11042-024-19590-1
IF: 2.577
2024-06-15
Multimedia Tools and Applications
Abstract:Employing a hierarchical framework of emotions in speech emotion recognition (SER) is an oft-explored approach towards SER. Existing works have so far defined the hierarchies arbitrarily. In this work, instead of using an arbitrary hierarchical framework, we have attempted to find a predetermined hierarchy of emotions based on the valence-arousal model (VA), which provides a natural categorization of emotions based on divisions in the VA plane. Using the German EMO-DB dataset, we designed two hierarchical classifiers HS1 and HS2 with features extracted directly from the global power spectra along with some prosodic features. We obtained accuracies of 72.9% and 72.6% for HS1 and HS2 respectively, which outperformed single-stage classifiers we developed by 5.8% and 5.5% respectively. This shows that the VA model can be promising for developing hierarchical SER systems. While our proposed methods achieved competitive accuracy (72.9% and 72.6%) compared to existing ordinary machine learning (starting from 50% to 75-79%), surpassing many, deep learning approaches still hold an advantage for most cases.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?