Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech Recognition

Chenfeng Miao,Kun Zou,Ziyang Zhuang,Tao Wei,Jun Ma,Shaojun Wang,Jing Xiao
DOI: https://doi.org/10.21437/INTERSPEECH.2022-11259
2022-01-01
Abstract:Inspired by EfficientTTS [1], a recent proposed speech synthesis model, we propose a new way to train attention-based end-to-end speech recognition models with an additional training objective, allowing the models to learn the monotonic alignments effectively and efficiently. The introduced training objective is differentiable, computationally cheap and most importantly, of no constraint on network structures. Thus, it is quite convenient to be incorporated into many speech recognition models. Through extensive experiments on CTC/Attetion architecture with conformer blocks, we observed that the performance of our models significantly outperform baseline models. Specifically, our best performing model achieves WER (Word Error Rate) 3.18% on LibriSpeech test-clean benchmark and 8.41% on test-other. Comparing with a strong baseline obtained by WeNet, the proposed model gets 7.6% relative WER reduction on test-clean and 6.9% on test-other.
What problem does this paper attempt to address?