Abstract:Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.

What problem does this paper attempt to address?

The paper primarily explores how to fine-tune the WavLM model for application in Speech Emotion Recognition (SER) tasks and investigates the following research questions: 1. **Does time-dimensional pooling affect the quality of speech emotion recognition?** The researchers experimented with different methods such as standard deviation (STD) pooling and attention pooling, and compared them with traditional average pooling. 2. **Does incorporating speaker gender information enhance emotion classification performance?** The paper proposes a method to integrate gender information into the model output through dot product multiplication. 3. **Does utilizing the textual information corresponding to speech segments aid in emotion classification?** The paper explores the possibility of combining textual information with the pooled WavLM output. To answer these questions, the research team conducted a series of experiments on the MSP Podcast Corpus dataset, specifically including: - Processing WavLM outputs using different pooling strategies, such as STD pooling and attention pooling; - Conditioning the model output with gender information; - Combining textual information to further improve model performance. Ultimately, the researchers found that: - **STD pooling in the time dimension** can improve the overall performance of the model; - **Incorporating gender information** indeed helps to enhance the accuracy of emotion classification; - **Adding textual information** did not significantly improve the model's performance and in some cases, even slightly decreased it. Additionally, to improve the generalization ability of the prediction results, the researchers adopted a model fusion approach, combining the predictions of multiple models to enhance the overall emotion recognition accuracy. In the Odyssey 2024 challenge, by fusing five different model configurations, they achieved an F1-macro score of 0.35, which was an improvement over the single best model.

Adapting WavLM for Speech Emotion Recognition

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora

WavLLM: Towards Robust and Adaptive Speech Large Language Model

MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition

Leveraging Self-Supervised Learning for Speaker Diarization

Leveraging Self-Supervised Models for Automatic Whispered Speech Recognition

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection

A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal Nuances

Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models

Pre-Finetuning for Few-Shot Emotional Speech Recognition

SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech

Prompting Large Language Models with Speech Recognition Abilities

Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition

Advancing Audio Emotion and Intent Recognition with Large Pre-Trained Models and Bayesian Inference

WavLM model ensemble for audio deepfake detection