Abstract:As the size of pre-trained speech recognition models increases, running these large models in low-latency or resource-constrained environments becomes challenging. In this work, we leverage pseudo-labelling to assemble a large-scale open-source dataset which we use to distill the Whisper model into a smaller variant, called Distil-Whisper. Using a simple word error rate (WER) heuristic, we select only the highest quality pseudo-labels for training. The distilled model is 5.8 times faster with 51% fewer parameters, while performing to within 1% WER on out-of-distribution test data in a zero-shot transfer setting. Distil-Whisper maintains the robustness of the Whisper model to difficult acoustic conditions, while being less prone to hallucination errors on long-form audio. Distil-Whisper is designed to be paired with Whisper for speculative decoding, yielding a 2 times speed-up while mathematically ensuring the same outputs as the original model. To facilitate further research in this domain, we make our training code, inference code and models publicly accessible.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: As pre - trained speech recognition models (such as Whisper) become larger and larger, running these large models in low - latency or resource - constrained environments becomes challenging. Specifically, the paper aims to compress the large Whisper model into a smaller and faster version (called Distil - Whisper) through Knowledge Distillation (KD) technology, while maintaining its robustness in different audio domains and noisy acoustic conditions and reducing hallucination errors in long - audio transcription. To achieve this goal, the authors took the following measures: 1. **Large - scale Pseudo - label Dataset**: A large - scale open - source dataset was constructed using the pseudo - label method for training the Distil - Whisper model. 2. **High - Quality Pseudo - label Selection**: The Word Error Rate (WER) heuristic method was used to screen high - quality pseudo - labels to ensure good performance in downstream tasks. 3. **Model Acceleration and Parameter Reduction**: The Distil - Whisper model obtained through distillation is 5.8 times faster than the original Whisper model, with 51% fewer parameters, and in the zero - shot transfer setting, the WER only differs by 1%. 4. **Robustness and Hallucination Error Reduction**: Distil - Whisper shows greater robustness in different audio domains and noisy conditions and has fewer hallucination errors in long - audio transcription. 5. **Speculative Decoding Optimization**: By performing speculative decoding in combination with the Whisper model, a two - fold increase in inference speed was achieved while mathematically ensuring the same output as the original model. In summary, the main goal of the paper is to effectively compress the Whisper model through knowledge distillation technology so that it can operate efficiently in resource - constrained environments while maintaining or improving its performance under various audio conditions.

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes

Efficient Compression of Multitask Multilingual Speech Models

DQ-Whisper: Joint Distillation and Quantization for Efficient Multilingual Speech Recognition

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Whispy: Adapting STT Whisper Models to Real-Time Environments

Transfer Learning from Whisper for Microscopic Intelligibility Prediction

DistillW2V2: A Small and Streaming Wav2vec 2.0 Based ASR Model

Big model only for hard audios: Sample dependent Whisper model selection for efficient inferences

Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection

Adaptation of Whisper models to child speech recognition

Exploring Native and Non-Native English Child Speech Recognition With Whisper

Leveraging Self-Supervised Models for Automatic Whispered Speech Recognition

Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation

Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings

Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper

Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification

On the Transferability of Whisper-based Representations for "In-the-Wild" Cross-Task Downstream Speech Applications