Adapting WavLM for Speech Emotion Recognition

Daria Diatlova,Anton Udalov,Vitalii Shutov,Egor Spirin
2024-05-08
Abstract:Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.
Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily explores how to fine-tune the WavLM model for application in Speech Emotion Recognition (SER) tasks and investigates the following research questions: 1. **Does time-dimensional pooling affect the quality of speech emotion recognition?** The researchers experimented with different methods such as standard deviation (STD) pooling and attention pooling, and compared them with traditional average pooling. 2. **Does incorporating speaker gender information enhance emotion classification performance?** The paper proposes a method to integrate gender information into the model output through dot product multiplication. 3. **Does utilizing the textual information corresponding to speech segments aid in emotion classification?** The paper explores the possibility of combining textual information with the pooled WavLM output. To answer these questions, the research team conducted a series of experiments on the MSP Podcast Corpus dataset, specifically including: - Processing WavLM outputs using different pooling strategies, such as STD pooling and attention pooling; - Conditioning the model output with gender information; - Combining textual information to further improve model performance. Ultimately, the researchers found that: - **STD pooling in the time dimension** can improve the overall performance of the model; - **Incorporating gender information** indeed helps to enhance the accuracy of emotion classification; - **Adding textual information** did not significantly improve the model's performance and in some cases, even slightly decreased it. Additionally, to improve the generalization ability of the prediction results, the researchers adopted a model fusion approach, combining the predictions of multiple models to enhance the overall emotion recognition accuracy. In the Odyssey 2024 challenge, by fusing five different model configurations, they achieved an F1-macro score of 0.35, which was an improvement over the single best model.