Syllable Level Speech Emotion Recognition Based on Formant Attention

Abdul Rehman,Zhen-Tao Liu,Jin-Meng Xu
DOI: https://doi.org/10.1007/978-3-030-93049-3_22
2021-01-01
Abstract:The performance of speech emotion recognition (SER) systems can be significantly compromised by the sentence structure of words being spoken. Since the relation between affective content and the lexical content of speech is difficult to determine in a small training sample, the temporal sequence based pattern recognition methods fail to generalize over different sentences in the wild. In this paper, a method to recognize emotion for each syllable separately instead of using a pattern recognition for a whole utterance is proposed. The work emphasizes the preprocessing of the received audio samples where the skeleton structure of Mel-spectrum is extracted using formant attention method, then utterances are sliced into syllables based on the contextual changes in the formants. The proposed syllable onset detection and feature extraction method is validated on two databases for the accuracy of emotional class prediction. The suggested SER method achieves up to 67% and 55% unweighted accuracy on IEMOCAP and MSP-Improv datasets, respectively. The effectiveness of the method is proved by the experimentation results and compared to the state-of-the-art SER methods.
What problem does this paper attempt to address?