Design of Tibetan Continuous Speech Corpus Based on Triphone

Yong Hong Li,Peng Cuo Dawa
DOI: https://doi.org/10.4028/www.scientific.net/amm.644-650.2245
2014-01-01
Applied Mechanics and Materials
Abstract:Large vocabulary continuous speech recognition system performance largely depends on the quality of speech corpus and how to select corpus is the key of corpus design. By taking Tibetan Amdo dialect in XiaHe as the research object, this paper builds continuous speech corpus based on triphone. At first, we collected text corpus with 1000 thousand Tibetan sentences and transformed them into IPA according to real pronunciation in XiaHe dialect, and then summarized the structure of triphone juncture, analyzed the combination types and frequency of triphone in corpus statistically with text-processing platform in detail. At last by comprehensively considering coverage rate and sparseness of triphone and class-triphone we designed the algorithm for extraction of corpus and realized automatic selection to corpus.
What problem does this paper attempt to address?