Focal Loss And Double-Edge-Triggered Detector For Robust Small-Footprint Keyword Spotting

Bin Liu,Shuai Nie,Yaping Zhang,Shan Liang,Zhanlei Yang,Wenju Liu
DOI: https://doi.org/10.1109/icassp.2019.8682534
2019-01-01
Abstract:Keyword spotting (KWS) system constitutes a critical component of human-computer interfaces, which detects the specific keyword from a continuous stream of audio. The goal of KWS is providing a high detection accuracy at a low false alarm rate while having small memory and computation requirements. The DNN-based KWS system faces a large class imbalance during training because the amount of data available for the keyword is usually much less than the background speech, which overwhelms training and leads to a degenerate model. In this paper, we explore the focal loss for the training of a small-footprint KWS system. It can automatically down-weight the contribution of easy samples during training and focus the model on hard samples, which naturally solves the class imbalance and allows us to efficiently utilize all data available. Furthermore, many keywords of Chinese conversational assistants are repeated words due to the idiomatic usage, such as 'XIAO DU XIAO DU'. We propose a double-edge-triggered detecting method for the repeated keyword, which significantly reduces the false alarm rate relative to the single threshold method. Systematic experiments demonstrate significant further improvements compared to the baseline system.
What problem does this paper attempt to address?