Real-time End-to-End Monaural Multi-speaker Speech Recognition

Song Li,Beibei Ouyang,Fuchuan Tong,Dexin Liao,Lin Li,Qingyang Hong
DOI: https://doi.org/10.21437/Interspeech.2021-1449
2021-01-01
Abstract:The rising interest in single-channel multi-speaker speech separation has triggered the development of end-to-end multispeaker automatic speech recognition (ASR). However, until now, most systems have adopted autoregressive mechanisms for decoding, resulting in slow decoding speed, which is not conducive to the application of multi-speaker speech recognition in real-world environments. In this paper, we first comprehensively investigate and compare the mainstream end-to-end multispeaker speech recognition systems. Secondly, we improve the recently proposed non-autoregressive end-to-end speech recognition model Mask-CTC, and introduce it to multi-speaker speech recognition to achieve real-time decoding. Our experiments on the LibriMix data set show that under the premise of the same amount of parameters, the non-autoregressive model achieves performance close to that of the autoregressive model while having a faster decoding speed.
What problem does this paper attempt to address?