Keyword Guided Target Speech Recognition

Ying Shi,Lantian Li,Dong Wang,Jiqing Han
DOI: https://doi.org/10.1109/lsp.2024.3432324
2024-01-01
IEEE Signal Processing Letters
Abstract:This letter presents a new target speech recognition problem, where the target speech is defined by a keyword. For instance, when a person speaks “Hey Google” or “Help Me”, we hope the model can recognize the entire contextual speech of that person, even with strong interference speech from other people. The new problem is denoted by target content ASR (TCASR). The core challenge of TC-ASR is that the model needs to simultaneously detect the existence of the keyword from heavily mixed speech and recognize the target speech component using the information of the detected keyword segment. Surprisingly, our experiments show that an attention encoder-decoder (AED) model augmented with a keyword encoder can solve this problem pretty well. We also defined a key content spotting (KCS) task and tested the proposed model on it. Our experiments on the LibriMix dataset demonstrated that our approach could address the KCS task with a promising accuracy, outperforming two baseline models by a large margin. Further analysis shows that the proposed model identifies the target speech by a timbre cue, i.e., ensuring that the identified speech is coherent in speaker trait.
What problem does this paper attempt to address?