Monophone-Based Background Modeling for Two-Stage On-Device Wake Word Detection

Minhua Wu,Sankaran Panchapagesan,Ming Sun,Jiacheng Gu,Ryan Thomas,Shiv Naga Prasad Vitaladevuni,Bjorn Hoffmeister,Arindam Mandal
DOI: https://doi.org/10.1109/icassp.2018.8462227
2018-01-01
Abstract:Accurate on-device wake word detection is crucial to products with far-field voice control such as the Amazon Echo. It is quite challenging to build a wake word system with both low False Reject Rate (FRR) and low False Alarm Rate (FAR) in real scenarios where there are various types of background speech, music or noise, especially when computational resources on the device is limited. In this paper, we introduce a two-stage wake word system based on Deep Neural Network (DNN) acoustic modeling, propose a new way to model the non-keyword background events using monophone-based units and present how richer information can be extracted from those monophone units for final wake word detection. Under the new system, we could get around 16% relative reduction in FRR when fixing the false alarm level, and about 37% relative reduction in FAR on the other hand if we maintain the miss rate. For the 2nd stage classifier itself, it is able to reduce the false alarm rate relatively by about 67% on top of 1st stage hypothesis with very few computational resources.
What problem does this paper attempt to address?