Improving Attention-Based End-to-End Speech Recognition by Monotonic Alignment Attention Matrix Reconstruction.

Ziyang Zhuang,Kun Zou,Chenfeng Miao,Ming Fang,Tao Wei,Zijian Li,Wei Hu,Shaojun Wang,Jing Xiao
DOI: https://doi.org/10.1109/ICASSP48485.2024.10447049
2024-01-01
Abstract:In automatic speech recognition (ASR) task, the output sequence should correspond to a linear transcription of the input sequence. Lots of works have been done to learn the monotonic alignment in end-to-end (E2E) ASR model, but their methods mainly focus on streaming propose and usually result in a decline in ASR performance. On the contrary, some studies have shown that for non-streaming attention-based models, monotonic alignment is beneficial to model performance. Based on this motivation, we propose the enhanced Gaussian Monotonic Alignment (e-GMA), which reduces the difficulty of learning monotonic alignment, and the reconstructed attention matrix leads to an improved accuracy in ASR tasks. Experiments on the LibriSpeech dataset demonstrate the effectiveness of the proposed approach. Comparing with a strong baseline obtained from WeNet, the proposed model yields 12.2% relative WER reduction on test-clean benchmark and 9.9% on test-other.
What problem does this paper attempt to address?