Dynamic Speech Endpoint Detection with Regression Targets

Dawei Liang,Hang Su,Tarun Singh,Jay Mahadeokar,Shanil Puri,Jiedan Zhu,Edison Thomaz,Mike Seltzer
DOI: https://doi.org/10.1109/icassp49357.2023.10096595
2022-01-01
Abstract:Interactive voice assistants have been widely used as input interfaces in various scenarios, e.g. on smart home devices, wearables and on AR devices. Detecting the end of a speech query, i.e. speech end-pointing, is an important task for voice assistants to interact with users. Traditionally, speech end-pointing is based on pure classification methods along with arbitrary binary targets. In this paper, we propose a novel regression-based speech end-pointing model, which enables an end-pointer to adjust its detection behavior based on the context of user queries. Specifically, we present a pause modeling method and show its effectiveness for dynamic end-pointing. Based on our experiments with vendor-collected smartphone and wearables speech queries, our strategy shows a better trade-off between end-pointing latency and accuracy, compared to the traditional classification-based method. We further discuss the benefits of this model and generalization of the framework in the paper.
What problem does this paper attempt to address?