End-to-end Contextual Speech Recognition Using Class Language Models and a Token Passing Decoder.

Zhehuai Chen,Mahaveer Jain,Yongqiang Wang,Michael L. Seltzer,Christian Fuegen
DOI: https://doi.org/10.1109/icassp.2019.8683573
2018-01-01
Abstract:End-to-end modeling ( E2E) of automatic speech recognition ( ASR) blends all the components of a traditional speech recognition system into a single, unified model. Although it simplifies the ASR systems, the unified model is hard to adapt when training and testing data mismatches. In this work, we focus on contextual speech recognition, which is particularly challenging for E2E models because contextual information is only available in inference time. To improve the performance in the presence of contextual information during training, we propose to use class-based language models ( CLM) that can populate context-dependent information during inference. To enable this approach to scale to a large number of class members and minimize search errors, we propose a token passing algorithm with an efficient token recombination for E2E systems. We evaluate the proposed system on general and contextual ASR tasks, and achieve relative 62% Word Error Rate ( WER) reduction for the contextual ASR task without hurting recognition performance for the general ASR task. We also show that the proposed method performs well without modification of the decoding hyper-parameters across tasks, making it a desirable solution for E2E ASR.
What problem does this paper attempt to address?