Self-Supervised Learning for Audio-Based Emotion Recognition

Peranut Nimitsurachat,Peter Washington
2023-07-23
Abstract:Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio data is rich, a major barrier to achieve consistently high-performance models is the paucity of available training labels. Self-supervised learning (SSL) is a family of methods which can learn despite a scarcity of supervised labels by predicting properties of the data itself. To understand the utility of self-supervised learning for audio-based emotion recognition, we have applied self-supervised learning pre-training to the classification of emotions from the CMU- MOSEI's acoustic modality. Unlike prior papers that have experimented with raw acoustic data, our technique has been applied to encoded acoustic data. Our model is first pretrained to uncover the randomly-masked timestamps of the acoustic data. The pre-trained model is then fine-tuned using a small sample of annotated data. The performance of the final model is then evaluated via several evaluation metrics against a baseline deep learning model with an identical backbone architecture. We find that self-supervised learning consistently improves the performance of the model across all metrics. This work shows the utility of self-supervised learning for affective computing, demonstrating that self-supervised learning is most useful when the number of training examples is small, and that the effect is most pronounced for emotions which are easier to classify such as happy, sad and anger. This work further demonstrates that self-supervised learning works when applied to embedded feature representations rather than the traditional approach of pre-training on the raw input space.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily explores the application and effectiveness of Self-Supervised Learning (SSL) in audio-based emotion recognition tasks. Specifically, the researchers investigated the following issues: 1. **Background and Challenges**: Emotion recognition models have a wide range of applications in fields such as mental health care, marketing, gaming, and social media analysis. However, one of the biggest obstacles to training high-performance emotion recognition models is the scarcity of available labeled data. 2. **Solution**: To address the above problem, the research team adopted a self-supervised learning approach to pre-train the model. This method can improve model performance in situations where labeled data is scarce. They applied this method to the audio modality of the CMU-Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset. 3. **Technical Details**: - Encoded audio data (74 parameters per time step) was used instead of raw audio data. - The model was first pre-trained by predicting randomly masked timestamps to learn the internal representation of audio features. - After pre-training, the model was fine-tuned using a small amount of labeled data to complete the emotion classification task. 4. **Experimental Results**: - Self-supervised learning significantly improved the model's performance on all evaluation metrics, especially in cases with less labeled data. - For some easier-to-recognize emotions (such as happiness, sadness, and anger), the effect of self-supervised learning was more pronounced. - For more difficult-to-recognize emotions (such as surprise and fear), the improvement from self-supervised learning was smaller. 5. **Conclusion and Outlook**: - The study confirmed the effectiveness of self-supervised learning in emotion recognition tasks, particularly in situations with scarce labeled data. - The paper pointed out some potential research directions, including trying different modality combinations and exploring other deep neural network architectures. In summary, this research provides an effective solution for audio-based emotion recognition and demonstrates the great potential of self-supervised learning in such data-scarce tasks.