Design and research of Tibetan spoken speech corpus

Xiaohui HUANG,Jing LI,Rui MA
DOI: https://doi.org/10.3778/j.issn.1002-8331.1702-0269
2018-01-01
Abstract:Based on the research and analysis of the construction method of traditional phonological corpus, combined with the related needs of natural spoken speech recognition and the characteristics of Tibetan natural spoken language, the construction scheme and annotation standard of spoken language corpus suitable for Tibetan speech recognition is designed. A 50-hour Tibetan Lhasa spoken corpus with five layers of annotation including phonemes, semitone, syllables, Tibetan word and sentences is also constructed. The statistic characteristics show that this corpus retains the natural properties of spoken language, and also has a balanced coverage of commonly used modeling units such as phonemes, semitone, so it is able to provide reliable data support for speech recognition technology based on Tibetan spoken speech data.
What problem does this paper attempt to address?