Abstract:Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio data is rich, a major barrier to achieve consistently high-performance models is the paucity of available training labels. Self-supervised learning (SSL) is a family of methods which can learn despite a scarcity of supervised labels by predicting properties of the data itself. To understand the utility of self-supervised learning for audio-based emotion recognition, we have applied self-supervised learning pre-training to the classification of emotions from the CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU- MOSEI)'s acoustic data. Unlike prior papers that have experimented with raw acoustic data, our technique has been applied to encoded acoustic data with 74 parameters of distinctive audio features at discrete timesteps. Our model is first pre-trained to uncover the randomly masked timestamps of the acoustic data. The pre-trained model is then fine-tuned using a small sample of annotated data. The performance of the final model is then evaluated via overall mean absolute error (MAE), mean absolute error (MAE) per emotion, overall four-class accuracy, and four-class accuracy per emotion. These metrics are compared against a baseline deep learning model with an identical backbone architecture. We find that self-supervised learning consistently improves the performance of the model across all metrics, especially when the number of annotated data points in the fine-tuning step is small. Furthermore, we quantify the behaviors of the self-supervised model and its convergence as the amount of annotated data increases. This work characterizes the utility of self-supervised learning for affective computing, demonstrating that self-supervised learning is most useful when the number of training examples is small and that the effect is most pronounced for emotions which are easier to classify such as happy, sad, and angry. This work further demonstrates that self-supervised learning still improves performance when applied to the embedded feature representations rather than the traditional approach of pre-training on the raw input space.

Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge

Self-attention Transfer Networks for Speech Emotion Recognition

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Self-Labeling Learning Ensemble via Deep Recurrent Neural Network and Self-Representation for Speech Emotion Recognition

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Towards Discriminative Representation Learning for Speech Emotion Recognition

A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition

Towards Discriminative Representations and Unbiased Predictions: Class-Specific Angular Softmax for Speech Emotion Recognition.

Self-Supervised EEG Representation Learning for Robust Emotion Recognition

Learning multi-scale features for speech emotion recognition with connection attention mechanism

Self-supervised utterance order prediction for emotion recognition in conversations

Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation

CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using Cochlear Cepstrum-based Masking for Speech Emotion Recognition

EEG-SCMM: Soft Contrastive Masked Modeling for Cross-Corpus EEG-Based Emotion Recognition

Audio-Based Emotion Recognition Using Self-Supervised Learning on an Engineered Feature Space

Emotion embedding framework with emotional self-attention mechanism for speaker recognition

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition