Abstract:In recent years, there has been a great deal of research in developing end-to-end speech recognition models, which enable simplifying the traditional pipeline and achieving promising results. Despite their remarkable performance improvements, end-to-end models typically require expensive computational cost to show successful performance. To reduce this computational burden, knowledge distillation (KD), which is a popular model compression method, has been used to transfer knowledge from a deep and complex model (teacher) to a shallower and simpler model (student). Previous KD approaches have commonly designed the architecture of the student model by reducing the width per layer or the number of layers of the teacher model. This structural reduction scheme might limit the flexibility of model selection since the student model structure should be similar to that of the given teacher. To cope with this limitation, we propose a new KD method for end-to-end speech recognition, namely TutorNet, that can transfer knowledge across different types of neural networks at the hidden representation-level as well as the output-level. For concrete realizations, we firstly apply representation-level knowledge distillation (RKD) during the initialization step, and then apply the softmax-level knowledge distillation (SKD) combined with the original task learning. When the student is trained with RKD, we make use of frame weighting that points out the frames to which the teacher model pays more attention. Through a number of experiments on LibriSpeech dataset, it is verified that the proposed method not only distills the knowledge between networks with different topologies but also significantly contributes to improving the word error rate (WER) performance of the distilled student. Interestingly, TutorNet allows the student model to surpass its teacher's performance in some particular cases.

Efficient Knowledge Distillation for RNN-Transducer Models

Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-trained Models

DCCD: Reducing Neural Network Redundancy Via Distillation

Robust Knowledge Distillation from RNN-T Models With Noisy Training Labels Using Full-Sum Loss

Knowledge Distillation Via Module Replacing for Automatic Speech Recognition with Recurrent Neural Network Transducer

Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Distil-DCCRN: A Small-footprint DCCRN Leveraging Feature-based Knowledge Distillation in Speech Enhancement

TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition

Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data

Knowledge Distillation Application Technology for Chinese NLP

Efficient Transformer Knowledge Distillation: A Performance Review

Weight Distillation: Transferring the Knowledge in Neural Network Parameters

Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation

Residual Error Based Knowledge Distillation

ResKD: Residual-Guided Knowledge Distillation

Compressing Transformer-Based ASR Model by Task-Driven Loss and Attention-Based Multi-Level Feature Distillation

An Efficient Method of Training Small Models for Regression Problems with Knowledge Distillation

Knowledge Transfer and Distillation from Autoregressive to Non-Autoregressive Speech Recognition

HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers