Fast Yet Effective Speech Emotion Recognition with Self-distillation

Zhao Ren,Thanh Tam Nguyen,Yi Chang,Björn W. Schuller

DOI: https://doi.org/10.48550/arXiv.2210.14636

2022-10-26

Abstract:Speech emotion recognition (SER) is the task of recognising human's emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depth can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.

Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in Speech Emotion Recognition (SER): 1. **Scarcity of data annotation**: Due to the high cost and time - consuming nature of speech data annotation, the annotated data available for training powerful models is limited. This restricts the application effectiveness of complex models such as deep neural networks on SER tasks. 2. **Trade - off between model efficiency and accuracy**: Although pre - trained large - scale models perform well on SER tasks through transfer learning, these models require a large amount of memory space during fine - tuning and have low inference efficiency. How to improve the running efficiency of the model while maintaining its accuracy is an important research direction. 3. **Adaptability on resource - constrained devices**: How to enable powerful models to execute at different depths on resource - constrained edge devices and achieve an adaptive trade - off between accuracy and efficiency is also an urgent problem to be solved. To address the above challenges, the author proposes a framework based on self - distillation. This framework can reduce the number of model parameters while maintaining the model performance and improve the model's inference efficiency. Specifically, this method simultaneously fine - tunes the pre - trained model and its shallower version, and uses the self - distillation technique to break through the problem of limited speech data annotation, thereby achieving fast and effective speech emotion recognition on resource - constrained edge devices.

Fast Yet Effective Speech Emotion Recognition with Self-distillation

Self-attention Transfer Networks for Speech Emotion Recognition

Speech Emotion Recognition under Resource Constraints with Data Distillation

Self-Labeling Learning Ensemble via Deep Recurrent Neural Network and Self-Representation for Speech Emotion Recognition

hierarchical network with decoupled knowledge distillation for speech emotion recognition

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speech Emotion Recognition Based on Meta-Transfer Learning with Domain Adaption

Leveraging Semantic Information for Efficient Self-Supervised Emotion Recognition with Audio-Textual Distilled Models

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition

Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT

Speech Emotion Recognition Using Self-Supervised Features

Active Learning Based Fine-Tuning Framework for Speech Emotion Recognition

Improvement on Speech Emotion Recognition Based on Deep Convolutional Neural Networks

Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition

Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models

Enhancing speech emotion recognition through deep learning and handcrafted feature fusion