Web-based keyword adapted Language Modeling for Keyword Spotting

Wenzhu Shen,Ji Wu,Wei Li
DOI: https://doi.org/10.1109/ISCSLP.2010.5684898
2010-01-01
Abstract:Language Model (LM) constitutes one of the key components in Keyword Spotting (KWS). The rapid development of the World Wide Web (WWW) makes it an extremely large and valuable data source for LM training, but it is not optimal to use the raw transcripts from WWW due to the mismatch of content between the web corpus and the test data. This paper proposes a novel two-step data selection method based on the predefined keyword list in language modeling for keyword spotting. First we exploit the keywords to be spotted, by submitting every keyword as a independent search engine query, it retrieves web corpus that can be used directly to train a web LM (However we didn't); Second we select the sentences with the predefined keywords from the raw web corpus. The final keyword-specific corpus selected is applied to train adaptive LM used to adapt general purpose one. Our keyword-specific LM allows the KWS task to be topic-independent, allowing the keywords to be random and irrelevant. Our experimental results show that the keyword-specific LM outperforms the one trained on the raw web corpus, while expanding the size of the web-based data corpus no longer improve the EER point of the KWS system, but improve the performance on both end of the DET (Detection Error Tradeoff) curve.
What problem does this paper attempt to address?