Multistate Encoding with End-To-End Speech RNN Transducer Network

Zelin Wu,Bo Li,Yu Zhang,Petar S. Aleksic,Tara N. Sainath
DOI: https://doi.org/10.1109/ICASSP40776.2020.9054287
2020-05-01
Abstract:Recurrent Neural Network Transducer (RNN-T) models [1] for automatic speech recognition (ASR) provide high accuracy speech recognition. Such end-to-end (E2E) models combine acoustic, pronunciation and language models (AM, PM, LM) of a conventional ASR system into a single neural network, dramatically reducing complexity and model size.In this paper, we propose a technique for incorporating contextual signals, such as intelligent assistant device state or dialog state, directly into RNN-T models. We explore different encoding methods and demonstrate that RNN-T models can effectively utilize such context. Our technique results in reduction in Word Error Rate (WER) of up to 10.4% relative on a variety of contextual recognition tasks. We also demonstrate that proper regularization can be used to model context independently for improved overall quality.
Computer Science
What problem does this paper attempt to address?