Multi-Level Knowledge Distillation for Speech Emotion Recognition in Noisy Conditions

Yang Liu,Haoqin Sun,Geng Chen,Qingyue Wang,Zhen Zhao,Xugang Lu,Longbiao Wang
2023-12-21
Abstract:Speech emotion recognition (SER) performance deteriorates significantly in the presence of noise, making it challenging to achieve competitive performance in noisy conditions. To this end, we propose a multi-level knowledge distillation (MLKD) method, which aims to transfer the knowledge from a teacher model trained on clean speech to a simpler student model trained on noisy speech. Specifically, we use clean speech features extracted by the wav2vec-2.0 as the learning goal and train the distil wav2vec-2.0 to approximate the feature extraction ability of the original wav2vec-2.0 under noisy conditions. Furthermore, we leverage the multi-level knowledge of the original wav2vec-2.0 to supervise the single-level output of the distil wav2vec-2.0. We evaluate the effectiveness of our proposed method by conducting extensive experiments using five types of noise-contaminated speech on the IEMOCAP dataset, which show promising results compared to state-of-the-art models.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily addresses the issue of performance degradation in Speech Emotion Recognition (SER) in noisy environments and proposes a solution. In practical applications, the presence of various noise sources significantly impacts the performance of SER systems, necessitating the development of a more robust SER system. The proposed method is called Multi-Level Knowledge Distillation (MLKD). The core idea is to extract knowledge from a teacher model trained on clean speech data and transfer this knowledge to a simpler student model trained on noisy speech data. Specifically, the method employs wav2vec-2.0 as a feature extractor, leveraging its ability to extract comprehensive emotional feature representations from raw audio waveforms. Then, a lightweight distil wav2vec-2.0 model is designed as the student model, and the multi-level knowledge from the teacher model is used to guide the feature extraction process of the student model. The experimental section uses the IEMOCAP dataset and adds five types of noise (babble, F-16 fighter jet, factory, HF radio, Volvo 340) to simulate different noisy environments. The experimental results show that, under all types of noise, the proposed MLKD method improves the Unweighted Accuracy (UA) by an average of 18.23% compared to the baseline methods. Additionally, the method significantly reduces the number of model parameters and inference time, making the student model more lightweight and suitable for practical application scenarios. In summary, the main contribution of the paper is the proposal of a new multi-level knowledge distillation framework that effectively enhances the performance of speech emotion recognition systems in noisy environments while maintaining low computational complexity.