Abstract:Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non-blank and blank frames differently for two main reasons. First, the non-blank frames in the teacher model's posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non-blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non-blank tokens in the teacher's blank-frame posteriors exhibit irregular probability distributions, negatively impacting the student model's learning. Thus, we propose to factorize the distillation of non-blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non-blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representation-based KD, in which hidden representations are divided into non-blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher's posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non-blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross-model topology KD.

Compressing Transformer-Based ASR Model by Task-Driven Loss and Attention-Based Multi-Level Feature Distillation

DCCD: Reducing Neural Network Redundancy Via Distillation

Self-Distillation Based on High-level Information Supervision for Compressing End-to-End ASR Model

Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression

Knowledge Distillation from Multiple Foundation Models for End-to-End Speech Recognition

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Factorized and progressive knowledge distillation for CTC-based ASR models

Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation

Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-trained Models

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Towards Efficient Pre-Trained Language Model Via Feature Correlation Distillation

Distilling a Pretrained Language Model to a Multilingual ASR Model

Speech Enhancement Based on Multi-Task Adaptive Knowledge Distillation

Efficient Knowledge Distillation for RNN-Transducer Models

Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models.

Multi-Task Transformer with Adaptive Cross-Entropy Loss for Multi-Dialect Speech Recognition

XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

MINILMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

Knowledge Distillation from Multilingual and Monolingual Teachers for End-to-End Multilingual Speech Recognition

Adversarial Data Augmentation for Task-Specific Knowledge Distillation of Pre-trained Transformers