Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Abdul Rehman,Zhen-Tao Liu,Min Wu,Wei-Hua Cao,Cheng-Shan Jiang
DOI: https://doi.org/10.1016/j.apacoust.2023.109444
IF: 3.614
2023-01-01
Applied Acoustics
Abstract:Speech emotion recognition systems have high computational requirements for deep learning models and low generalizability mainly because of the poor reliability of emotional measurements across mul-tiple corpora. To solve these problems, we present a speech emotion recognition system based on a reductionist approach of decomposing and analyzing syllable-level features. Mel-spectrogram of an audio stream is decomposed into syllable-level components, which are then analyzed to extract statistical fea-tures. The proposed method uses formant attention, noise-gate filtering, and rolling normalization con-texts to decrease contextual differences and increase focus the attention on the structure of a formant. A set of syllable-level formant features is extracted and fed into a single hidden layer neural network that makes predictions for each syllable as opposed to the conventional approach of using a sophisticated deep learner to make sentence-wide predictions. The syllable level predictions help to lower the aggre-gated error in utterance level cross-corpus predictions. The experiments on IEMOCAP (IE), MSP-Improv (MI), and RAVDESS (RA) databases show that the method archives better than the state-of-the-art cross-corpus unweighted accuracy of 47.6% for IE to MI and 56.2% for MI to IE.& COPY; 2023 Elsevier Ltd. All rights reserved.
What problem does this paper attempt to address?