SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios

Hazim Bukhari,Soham Deshmukh,Hira Dhamyal,Bhiksha Raj,Rita Singh
2024-07-22
Abstract:Speech Emotion Recognition (SER) has been traditionally formulated as a classification task. However, emotions are generally a spectrum whose distribution varies from situation to situation leading to poor Out-of-Domain (OOD) performance. We take inspiration from statistical formulation of Automatic Speech Recognition (ASR) and formulate the SER task as generating the most likely sequence of text tokens to infer emotion. The formulation breaks SER into predicting acoustic model features weighted by language model prediction. As an instance of this approach, we present SELM, an audio-conditioned language model for SER that predicts different emotion views. We train SELM on curated speech emotion corpus and test it on three OOD datasets (RAVDESS, CREMAD, IEMOCAP) not used in training. SELM achieves significant improvements over the state-of-the-art baselines, with 17% and 7% relative accuracy gains for RAVDESS and CREMA-D, respectively. Moreover, SELM can further boost its performance by Few-Shot Learning using a few annotated examples. The results highlight the effectiveness of our SER formulation, especially to improve performance in OOD scenarios.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **to improve the performance of Speech Emotion Recognition (SER) in Out - of - Domain (OOD) scenarios**. Specifically, traditional emotion recognition methods usually regard the task as a classification problem, but this method performs poorly when dealing with OOD data because emotion is a continuous spectrum and its distribution varies according to the situation. Therefore, when there are differences between the test data and the training data in terms of speakers, languages, accents, recording conditions or emotion categories, the performance of the model will decline significantly. To solve this problem, the author proposes a new SER framework, namely **SELM (Speech Emotion Language Model)**. SELM infers emotions by generating the most likely emotional text sequences instead of simply classifying. This framework draws on the statistical formulation method of Automatic Speech Recognition (ASR) and decomposes the SER task into a weighted combination of predicting acoustic model features and language model prediction. SELM has been tested on three unseen OOD datasets (RAVDESS, CREMAD, IEMOCAP) and has achieved significant performance improvement. In addition, SELM also demonstrates the ability to further improve OOD performance through Few - Shot Learning. The experimental results show that SELM has better generalization ability and higher accuracy compared with existing methods in OOD scenarios. ### Main contributions: 1. **Proposed a new formulation method for SER tasks**: Decomposed the SER task into an acoustic model and a language model. 2. **Introduced the SELM model**: This model provides state - of - the - art performance in OOD scenarios. 3. **Proposed the Few - Shot Learning method of SELM**. 4. **Conducted extensive tests on multiple datasets**: Including In - Domain, OOD and Few - Shot Learning settings, establishing a benchmark for future work. Through these improvements, SELM can effectively recognize speech emotions in a wider range of scenarios, especially when encountering unseen data distributions.