Multi-Level Knowledge Distillation for Speech Emotion Recognition in Noisy Conditions

Yang Liu,Haoqin Sun,Geng Chen,Qingyue Wang,Zhen Zhao,Xugang Lu,Longbiao Wang

2023-12-21

Abstract:Speech emotion recognition (SER) performance deteriorates significantly in the presence of noise, making it challenging to achieve competitive performance in noisy conditions. To this end, we propose a multi-level knowledge distillation (MLKD) method, which aims to transfer the knowledge from a teacher model trained on clean speech to a simpler student model trained on noisy speech. Specifically, we use clean speech features extracted by the wav2vec-2.0 as the learning goal and train the distil wav2vec-2.0 to approximate the feature extraction ability of the original wav2vec-2.0 under noisy conditions. Furthermore, we leverage the multi-level knowledge of the original wav2vec-2.0 to supervise the single-level output of the distil wav2vec-2.0. We evaluate the effectiveness of our proposed method by conducting extensive experiments using five types of noise-contaminated speech on the IEMOCAP dataset, which show promising results compared to state-of-the-art models.

Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper primarily addresses the issue of performance degradation in Speech Emotion Recognition (SER) in noisy environments and proposes a solution. In practical applications, the presence of various noise sources significantly impacts the performance of SER systems, necessitating the development of a more robust SER system. The proposed method is called Multi-Level Knowledge Distillation (MLKD). The core idea is to extract knowledge from a teacher model trained on clean speech data and transfer this knowledge to a simpler student model trained on noisy speech data. Specifically, the method employs wav2vec-2.0 as a feature extractor, leveraging its ability to extract comprehensive emotional feature representations from raw audio waveforms. Then, a lightweight distil wav2vec-2.0 model is designed as the student model, and the multi-level knowledge from the teacher model is used to guide the feature extraction process of the student model. The experimental section uses the IEMOCAP dataset and adds five types of noise (babble, F-16 fighter jet, factory, HF radio, Volvo 340) to simulate different noisy environments. The experimental results show that, under all types of noise, the proposed MLKD method improves the Unweighted Accuracy (UA) by an average of 18.23% compared to the baseline methods. Additionally, the method significantly reduces the number of model parameters and inference time, making the student model more lightweight and suitable for practical application scenarios. In summary, the main contribution of the paper is the proposal of a new multi-level knowledge distillation framework that effectively enhances the performance of speech emotion recognition systems in noisy environments while maintaining low computational complexity.

Multi-Level Knowledge Distillation for Speech Emotion Recognition in Noisy Conditions

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

hierarchical network with decoupled knowledge distillation for speech emotion recognition

Integrated Multi-Level Knowledge Distillation for Enhanced Speaker Verification

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Using Speech Enhancement Preprocessing for Speech Emotion Recognition in Realistic Noisy Conditions

Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

Fast Yet Effective Speech Emotion Recognition with Self-distillation

Knowledge Distillation from Multiple Foundation Models for End-to-End Speech Recognition

Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Knowledge enhancement for speech emotion recognition via multi-level acoustic feature

Noise-Separated Adaptive Feature Distillation for Robust Speech Recognition

Distil-DCCRN: A Small-footprint DCCRN Leveraging Feature-based Knowledge Distillation in Speech Enhancement

Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition

LLM-Enhanced Multi-Teacher Knowledge Distillation for Modality-Incomplete Emotion Recognition in Daily Healthcare

Efficient signal-to-noise ratio enhancement model for severely contaminated distributed acoustic sensing seismic data based on heterogeneous knowledge distillation

Multi-level knowledge distillation for low-resolution object detection and facial expression recognition