Addressing the Out-of-vocabulary Problem for Large-Scale Chinese Spoken Term Detection

Sha Meng,Jian Shao,Roger Peng Yu,Jia Liu,Frank Seide
DOI: https://doi.org/10.21437/interspeech.2008-562
2008-01-01
Abstract:While the Out-Of-Vocabulary (OOV) problem remains a challenge for English spoken term detection tasks, it is underestimated for Chinese. This is because an Chinese OOV query term can still be matched as a sequence of Chinese characters, with each character itself being a word in the vocabulary. However, our experiments show that search accuracy levels differ significantly when a query is or is not in the vocabulary. In-Vocabulary (INV) queries outperform OOV queries for more than 20%. We examine this problem with a word-lattice-based spoken term detection task. We propose a two-stage method by first locating candidates by partial phonetic matching and then refining the matching score with word lattice rescoring. Experiments show that the proposed method achieves a 24.1% relative improvement for OOV queries on a large-scale Chinese spoken term detection task.
What problem does this paper attempt to address?