End-to-End Contextual Speech Recognition with Word-Piece-Level Token Selection.

Zhibin Wu,Yang Zou,Jian Zhou,Min Wang,Xiaoqin Zeng
DOI: https://doi.org/10.18293/dmsviva2023-003
2023-01-01
Abstract:The utilization of dynamic contextual information in end-to-end automatic speech recognition has been an active research topic.Generally, the popular Contextual LAS (CLAS) provides favorable all-neural solutions.Nevertheless, it cannot be extended to large bias lists without many cases of recognition errors caused by similar pronunciation or word fragment repetition.To address this limitation, this paper proposes a model called Fine-CLAS on the basis of CLAS, which exploits wordpiece-level contextual knowledge and fuse it with the original phrase-level contextual knowledge to enable the contextual bias module to focus on fine-grained contextual information.First, the prefix tree constraint is presented to reduce the number of contextual phrases.Then, a strategy for word-piece-level token selection is designed to obtain the new word-piece-level embedding vector.Finally, a contextual transformation chain is constructed between the word-piece-level embedding vector key-value pairs to attain new key-value pairs.The proposed model with these techniques can reduce the word error rate (WER) by 5.37% and 2.10%, and the F1-score by 1.10% and 2.10% on the datasets testclean and test-other of LibriSpeech, demonstrating preferable ASR and contextual bias performance.
What problem does this paper attempt to address?