Abstract:Transducer is one of the mainstream frameworks for streaming speech recognition. There is a performance gap between the streaming and non-streaming transducer models due to limited context. To reduce this gap, an effective way is to ensure that their hidden and output distributions are consistent, which can be achieved by hierarchical knowledge distillation. However, it is difficult to ensure the distribution consistency simultaneously because the learning of the output distribution depends on the hidden one. In this paper, we propose an adaptive two-stage knowledge distillation method consisting of hidden layer learning and output layer learning. In the former stage, we learn hidden representation with full context by applying mean square error loss function. In the latter stage, we design a power transformation based adaptive smoothness method to learn stable output distribution. It achieved 19\% relative reduction in word error rate, and a faster response for the first token compared with the original streaming model in LibriSpeech corpus.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **narrowing the performance gap between streaming and non - streaming transducer - based Automatic Speech Recognition (ASR) models**. ### Problem Background Streaming models are very important in practical applications due to their ability to respond quickly, such as in real - time speech recognition. However, because streaming models can only utilize limited context information, their performance is usually significantly lower than that of non - streaming models. This performance gap is mainly reflected in the differences in the distributions of hidden layers and output layers. ### Solution To narrow this gap, the author proposes an **adaptive two - stage knowledge distillation method**. Specifically: 1. **First stage: Hidden layer learning** - At this stage, the author learns the hidden representations by applying the Mean Squared Error (MSE) loss function, ensuring that the streaming model can obtain complete context information from the non - streaming model. - The formula is as follows: \[ L_{\text{hidden}}=\sum_{i = 1}^{N}\text{MSE}(E_S^i, E_T^i)+\sum_{j = 1}^{M}\text{MSE}(D_S^j, D_T^j) \] where \(E_S^i\) and \(E_T^i\) are the outputs of the \(i\)-th layer encoder of the streaming and non - streaming models respectively, and \(D_S^j\) and \(D_T^j\) are the outputs of the \(j\)-th layer decoder of the streaming and non - streaming models respectively. 2. **Second stage: Output layer learning** - At this stage, the author designs an adaptive smoothing method based on power transformation to learn a stable output distribution. - Specifically, the Kullback - Leibler (KL) divergence is used to minimize the distance of the output distribution, and a temperature coefficient is introduced to control the smoothness of the output distribution. - The formula is as follows: \[ L_{\text{output}}=L_{\text{rnn - t}}(Q_S, y_{\text{true}})+\text{KL}\left(\frac{Q_S}{\tau},\frac{Q_T}{\tau}\right) \] where \(Q_S\) and \(Q_T\) are the output distributions of the streaming and non - streaming models respectively, and \(\tau\) is the temperature parameter. 3. **Two - stage KD loss function** - Combine the above two tasks to form the loss function of two - stage KD: \[ L_{\text{KD}}=\alpha\times L_{\text{hidden}}+\beta\times L_{\text{output}} \] where \(\alpha\) and \(\beta\) are hyperparameters used to balance the learning of the two tasks. ### Experimental Results Through experiments on the LibriSpeech corpus, this method has achieved the following results: - The Relative Word Error Rate (WER) is reduced by 19.24%. - The response time for the first token is faster, which is an improvement compared to the original streaming model. In conclusion, this paper effectively narrows the performance gap between streaming and non - streaming ASR models and improves the accuracy and response speed of streaming models through the proposed adaptive two - stage knowledge distillation method.

Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer

Efficient Knowledge Distillation for RNN-Transducer Models

Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-trained Models

Knowledge Distillation from Multiple Foundation Models for End-to-End Speech Recognition

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

DistillW2V2: A Small and Streaming Wav2vec 2.0 Based ASR Model

Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation

Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data

Mutual-learning Sequence-Level Knowledge Distillation for Automatic Speech Recognition

Adaptive Knowledge Distillation between Text and Speech Pre-trained Models

Compressing Transformer-Based ASR Model by Task-Driven Loss and Attention-Based Multi-Level Feature Distillation

Speech Enhancement Based on Multi-Task Adaptive Knowledge Distillation

A lightweight speech recognition method with target-swap knowledge distillation for Mandarin air traffic control communications

Knowledge Transfer and Distillation from Autoregressive to Non-Autoregressive Speech Recognition

Factorized and progressive knowledge distillation for CTC-based ASR models

Decouple Non-parametric Knowledge Distillation For End-to-end Speech Translation

Partial Rewriting for Multi-Stage ASR

Knowledge Distillation from Multilingual and Monolingual Teachers for End-to-End Multilingual Speech Recognition

INCREMENTAL LEARNING FOR END-TO-END AUTOMATIC SPEECH RECOGNITION

Two-Step Knowledge Distillation for Tiny Speech Enhancement