Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

Sanchit Gandhi,Patrick von Platen,Alexander M. Rush
2023-11-01
Abstract:As the size of pre-trained speech recognition models increases, running these large models in low-latency or resource-constrained environments becomes challenging. In this work, we leverage pseudo-labelling to assemble a large-scale open-source dataset which we use to distill the Whisper model into a smaller variant, called Distil-Whisper. Using a simple word error rate (WER) heuristic, we select only the highest quality pseudo-labels for training. The distilled model is 5.8 times faster with 51% fewer parameters, while performing to within 1% WER on out-of-distribution test data in a zero-shot transfer setting. Distil-Whisper maintains the robustness of the Whisper model to difficult acoustic conditions, while being less prone to hallucination errors on long-form audio. Distil-Whisper is designed to be paired with Whisper for speculative decoding, yielding a 2 times speed-up while mathematically ensuring the same outputs as the original model. To facilitate further research in this domain, we make our training code, inference code and models publicly accessible.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: As pre - trained speech recognition models (such as Whisper) become larger and larger, running these large models in low - latency or resource - constrained environments becomes challenging. Specifically, the paper aims to compress the large Whisper model into a smaller and faster version (called Distil - Whisper) through Knowledge Distillation (KD) technology, while maintaining its robustness in different audio domains and noisy acoustic conditions and reducing hallucination errors in long - audio transcription. To achieve this goal, the authors took the following measures: 1. **Large - scale Pseudo - label Dataset**: A large - scale open - source dataset was constructed using the pseudo - label method for training the Distil - Whisper model. 2. **High - Quality Pseudo - label Selection**: The Word Error Rate (WER) heuristic method was used to screen high - quality pseudo - labels to ensure good performance in downstream tasks. 3. **Model Acceleration and Parameter Reduction**: The Distil - Whisper model obtained through distillation is 5.8 times faster than the original Whisper model, with 51% fewer parameters, and in the zero - shot transfer setting, the WER only differs by 1%. 4. **Robustness and Hallucination Error Reduction**: Distil - Whisper shows greater robustness in different audio domains and noisy conditions and has fewer hallucination errors in long - audio transcription. 5. **Speculative Decoding Optimization**: By performing speculative decoding in combination with the Whisper model, a two - fold increase in inference speed was achieved while mathematically ensuring the same output as the original model. In summary, the main goal of the paper is to effectively compress the Whisper model through knowledge distillation technology so that it can operate efficiently in resource - constrained environments while maintaining or improving its performance under various audio conditions.