Abstract:Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio data is rich, a major barrier to achieve consistently high-performance models is the paucity of available training labels. Self-supervised learning (SSL) is a family of methods which can learn despite a scarcity of supervised labels by predicting properties of the data itself. To understand the utility of self-supervised learning for audio-based emotion recognition, we have applied self-supervised learning pre-training to the classification of emotions from the CMU- MOSEI's acoustic modality. Unlike prior papers that have experimented with raw acoustic data, our technique has been applied to encoded acoustic data. Our model is first pretrained to uncover the randomly-masked timestamps of the acoustic data. The pre-trained model is then fine-tuned using a small sample of annotated data. The performance of the final model is then evaluated via several evaluation metrics against a baseline deep learning model with an identical backbone architecture. We find that self-supervised learning consistently improves the performance of the model across all metrics. This work shows the utility of self-supervised learning for affective computing, demonstrating that self-supervised learning is most useful when the number of training examples is small, and that the effect is most pronounced for emotions which are easier to classify such as happy, sad and anger. This work further demonstrates that self-supervised learning works when applied to embedded feature representations rather than the traditional approach of pre-training on the raw input space.

What problem does this paper attempt to address?

The paper primarily explores the application and effectiveness of Self-Supervised Learning (SSL) in audio-based emotion recognition tasks. Specifically, the researchers investigated the following issues: 1. **Background and Challenges**: Emotion recognition models have a wide range of applications in fields such as mental health care, marketing, gaming, and social media analysis. However, one of the biggest obstacles to training high-performance emotion recognition models is the scarcity of available labeled data. 2. **Solution**: To address the above problem, the research team adopted a self-supervised learning approach to pre-train the model. This method can improve model performance in situations where labeled data is scarce. They applied this method to the audio modality of the CMU-Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset. 3. **Technical Details**: - Encoded audio data (74 parameters per time step) was used instead of raw audio data. - The model was first pre-trained by predicting randomly masked timestamps to learn the internal representation of audio features. - After pre-training, the model was fine-tuned using a small amount of labeled data to complete the emotion classification task. 4. **Experimental Results**: - Self-supervised learning significantly improved the model's performance on all evaluation metrics, especially in cases with less labeled data. - For some easier-to-recognize emotions (such as happiness, sadness, and anger), the effect of self-supervised learning was more pronounced. - For more difficult-to-recognize emotions (such as surprise and fear), the improvement from self-supervised learning was smaller. 5. **Conclusion and Outlook**: - The study confirmed the effectiveness of self-supervised learning in emotion recognition tasks, particularly in situations with scarce labeled data. - The paper pointed out some potential research directions, including trying different modality combinations and exploring other deep neural network architectures. In summary, this research provides an effective solution for audio-based emotion recognition and demonstrates the great potential of self-supervised learning in such data-scarce tasks.

Self-Supervised Learning for Audio-Based Emotion Recognition

The adult respiratory distress syndrome. Definition and prognosis.

Self-attention Transfer Networks for Speech Emotion Recognition

Semi-Supervised Self-Learning Enhanced Music Emotion Recognition

Leveraging Semantic Information for Efficient Self-Supervised Emotion Recognition with Audio-Textual Distilled Models

Health promotion for socially disadvantaged groups: the case of homeless older men in Australia.

Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition

Self-Supervised Learning for ECG-Based Emotion Recognition

Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition

Revisiting Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

Masked self‐supervised pre‐training model for EEG‐based emotion recognition

Self-Supervised EEG Representation Learning for Robust Emotion Recognition

Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge

CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using Cochlear Cepstrum-based Masking for Speech Emotion Recognition

End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild

Speech Emotion Recognition Using Attention Model

Few-shot Learning in Emotion Recognition of Spontaneous Speech Using a Siamese Neural Network with Adaptive Sample Pair Formation

Self-Labeling Learning Ensemble via Deep Recurrent Neural Network and Self-Representation for Speech Emotion Recognition

Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora

Self-supervised learning for infant cry analysis

HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition