Abstract:This paper compared the performance of different acoustic modeling units in deep neural networks (DNNs) based large vocabulary continuous speech recognition (LVCSR) systems for Chinese. Recently, the deep neural networks based acoustic modeling method has achieved very competitive performance for many speech recognition tasks, and has become the focus of current LVCSR research. Some previous work have studied the context independent and context dependent DNNs based acoustic models. For Chinese, a syllabic language, the choice of basic modeling units under the background of DNNs based LVCSR systems is a very important issue.Three basic modeling units, syllables, initial/finals, phones, are discussed and compared. Experimental results show that, in the DNNs based systems, the context dependent (CD) phones obtain the best performance, and the context independent (CI) syllables have the similar performance with the CD initial/finals. How the number of clustered states impacts on the performance of DNNs based systems is also discussed, which showed different properties from the GMMs based systems. Besides, through introducing the multi-task learning strategy, these multiple modeling units can be combined in the DNNs training procedure. The experimental results indicate that combining these multiple modeling units using multi-task learning outperforms each individual modeling unit.

A usage of the syllable unit based on morphological statistics in Korean large vocabulary continuous speech recognition system

Phonological modeling for continuous speech recognition in Korean

SyllableLM: Learning Coarse Semantic Units for Speech Language Models

Phoneme-level speech and natural language intergration for agglutinative languages

Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition

Incorporating L2 Phonemes Using Articulatory Features for Robust Speech Recognition

Subword scheme for keyword search

Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr

Research on Inter-Syllable Context-Dependent Acoustic Unit for Mandarin Continuous Speech Recognition.

K-Wav2vec 2.0: Automatic Speech Recognition based on Joint Decoding of Graphemes and Syllables

Context Dependent Syllable Acoustic Model For Continuous Chinese Speech Recognition

Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition

A resource-based Korean morphological annotation system

A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

Generation of syllable-lattice based on active state in Mandarin continuous speech recognition

Visual Information Assisted Mandarin Large Vocabulary Continuous Speech Recognition

Phoneme Modeling Units Design for Mandarin LVCSR Systems

Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units

A Method for Syllable Segmentation in Mandarin Speech Recognition