Fast Yet Effective Speech Emotion Recognition with Self-distillation

Zhao Ren,Thanh Tam Nguyen,Yi Chang,Björn W. Schuller
DOI: https://doi.org/10.48550/arXiv.2210.14636
2022-10-26
Abstract:Speech emotion recognition (SER) is the task of recognising human's emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depth can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key challenges in Speech Emotion Recognition (SER): 1. **Scarcity of data annotation**: Due to the high cost and time - consuming nature of speech data annotation, the annotated data available for training powerful models is limited. This restricts the application effectiveness of complex models such as deep neural networks on SER tasks. 2. **Trade - off between model efficiency and accuracy**: Although pre - trained large - scale models perform well on SER tasks through transfer learning, these models require a large amount of memory space during fine - tuning and have low inference efficiency. How to improve the running efficiency of the model while maintaining its accuracy is an important research direction. 3. **Adaptability on resource - constrained devices**: How to enable powerful models to execute at different depths on resource - constrained edge devices and achieve an adaptive trade - off between accuracy and efficiency is also an urgent problem to be solved. To address the above challenges, the author proposes a framework based on self - distillation. This framework can reduce the number of model parameters while maintaining the model performance and improve the model's inference efficiency. Specifically, this method simultaneously fine - tunes the pre - trained model and its shallower version, and uses the self - distillation technique to break through the problem of limited speech data annotation, thereby achieving fast and effective speech emotion recognition on resource - constrained edge devices.