Subword scheme for keyword search

Zhipeng Chen,Teng Zhang,Ji Wu
DOI: https://doi.org/10.1109/SLT.2014.7078622
2014-01-01
Abstract:Keyword search (KWS) is an important application of spoken language technology. The technique of Large Vocabulary Continuous Speech Recognition (LVCSR) is playing an important role in KWS system. However, for a language with large vocabulary and relatively insufficient text corpus, the vocabulary size keeps going up very quickly with the increasing amount of text, as we observed in Tamil. This brings difficulty in training a reliable language model, which may undermine KWS performance. Subword unit has been successfully employed in KWS system to handle out-of-vocabulary (OOV) problem. Inspired by this, we propose a novel subword scheme from the perspective of pronunciation to alleviate the large vocabulary problem. We find that the subword-based system outperforms our best word-based system on Tamil conversational telephone speech. The experiment of system combination shows that, over the best word-based system, a single subword-based system contains more complementary information than the total of that of the other three word-based systems.
What problem does this paper attempt to address?