Knowledge Distillation for End-to-End Monaural Multi-talker ASR System

Wangyou Zhang,Xuankai Chang,Yanmin Qian
DOI: https://doi.org/10.21437/interspeech.2019-3192
2019-01-01
Abstract:End-to-end models for monaural multi-speaker automatic speech recognition (ASR) have become an important and interesting approach when dealing with the multi-talker mixed speech under cocktail party scenario. However, there is still a large performance gap between the multi-speaker and single-speaker speech recognition systems. In this paper, we propose a novel framework that integrates teacher-student training with the attention-based end-to-end ASR model, which can do the knowledge distillation from the single-talker ASR system to multi-talker one effectively. First the objective function is revised to combine the knowledge from both single-talker and multi-talker labels. Then we extend the original single attention to speaker parallel attention modules in the teacher-student training based end-to-end framework to boost the performance more. Moreover, a curriculum learning strategy on the training data with an ordered signal-to-noise ratios (SNRs) is designed to obtain a further improvement. The proposed methods are evaluated on two-speaker mixed speech generated from the WSJ0 corpus, which is commonly used for this task recently. The experimental results show that the newly proposed knowledge transfer architecture with an end-to-end model can significantly improve the system performance for monaural multitalker speech recognition, and more than 15% relative WER reduction is achieved against the traditional end-to-end model.
What problem does this paper attempt to address?