Tibetan-Mandarin Bilingual Speech Recognition Based on End-to-end Framework

Qingnan Wang,Wu Guo,Peixin Chen,Yan Song
DOI: https://doi.org/10.1109/apsipa.2017.8282215
2017-01-01
Abstract:Tibetan-Mandarin bilingual speech recognition is addressed in this paper. Because there is a great difference between the phoneme sets of these languages, it is difficult to find a universal phoneme set for the bilingual acoustic model (AM) in the conventional hidden Markov model (HMM) framework. The end-to-end framework based on connectionist temporal classification (CTC) loss function is proposed to solve this problem by using the character as the modeling unit instead of the phoneme. However, the sparseness problem of model units is an intractable and ineluctable fact in CTC model training, particularly under low-resource conditions. This paper explores two methods to address this problem. First, different model units are selected. The Tibetan characters and the Mandarin non-tonal syllables are used as the CTC output units. Second, an adding noise algorithm is applied to the bilingual part of the training corpus to augment Mandarin speech. The experiments are carried out on the hybrid IFLYTEK Tibetan-Mandarin corpus. Obvious improvements can be observed by using the proposed methods.
What problem does this paper attempt to address?