Abstract:Traditional weighted finite-state transducer- (WFST) based Mongolian automatic speech recognition (ASR) systems use phonemes as pronunciation lexicon modeling units. However, Mongolian is an agglutinative, low-resource language, and building an ASR system based on the phoneme pronunciation lexicon remains a challenge for various reasons. First, the phoneme pronunciation lexicon manually constructed by Mongolian linguists is finite, which is usually used to build a grapheme-to-phoneme conversion (G2P) model to frequently expand new words. However, the data sparsity decreases the robustness of the G2P model and affects the performance of the final ASR system. Second, homophones and polysyllabic words are common in Mongolian, which has a certain impact on the construction of the Mongolian acoustic model. To address these problems, in this work, we first propose a grapheme-to-phoneme alignment model to obtain the mapping relationship between phonemes and subword units. Then, we construct an acoustic subword segmentation set to segment words directly instead of using the traditional G2P method to predict phoneme sequences to expand the pronunciation lexicon. Further, by analyzing the Mongolian encoding form, we also propose an acoustic subword modeling units construction method that removes control characters. Finally, we investigate various acoustic subword modeling units for pronunciation lexicon construction for the Mongolian ASR system. Experiments on a Mongolian dataset with 325 hours of training show that the pronunciation lexicon based on the acoustic subword modeling unit can effectively construct the WFST-based Mongolian ASR system. Further, removing the control characters when building the acoustic subword modeling unit can further improve the ASR system performance.

Research on Mongolian Speech Recognition Based on FSMN.

Mongolian acoustic modeling based on deep neural network

Mongolian Speech Recognition Based on Deep Neural Networks

Deep-FSMN for Large Vocabulary Continuous Speech Recognition

A Mongolian Language Model Based on Recurrent Neural Networks

Research on Transfer Learning for Khalkha Mongolian Speech Recognition Based on TDNN

Mongolian Text-to-Speech System Based on Deep Neural Network

A novel pyramidal-FSMN architecture with lattice-free MMI for speech recognition

Compact Feedforward Sequential Memory Networks for Large Vocabulary Continuous Speech Recognition

A Parameter Transfer Method for HMM-DNN Heterogeneous Model with the Scarce Mongolian Data Set

Improving of Acoustic Model for the Mongolian Speech Recognition System

A METHOD TO CONSTRUCT AN ADAPTIVE MONGOLIAN SPEECH ACOUSTIC MODEL

Nonrecurrent Neural Structure for Long-Term Dependence.

A Comparative Study on Selecting Acoustic Modeling Units for WFST-based Mongolian Speech Recognition

Feedforward Sequential Memory Networks: A New Structure to Learn Long-term Dependency.

Deep Feed-Forward Sequential Memory Networks for Speech Synthesis

Deep Neural Network based Uyghur Large Vocabulary Continuous Speech Recognition

Incorporating Inner-word and Out-word Features for Mongolian Morphological Segmentation

Improving Accented Mandarin Speech Recognition by Using Recurrent Neural Network Based Language Model Adaptation

Online Handwritten Mongolian Word Recognition Using a Novel Sliding Window Method with Recurrent Neural Networks.

Mongolian Named Entity Recognition with Bidirectional Recurrent Neural Networks