Abstract:This paper compared the performance of different acoustic modeling units in deep neural networks (DNNs) based large vocabulary continuous speech recognition (LVCSR) systems for Chinese. Recently, the deep neural networks based acoustic modeling method has achieved very competitive performance for many speech recognition tasks, and has become the focus of current LVCSR research. Some previous work have studied the context independent and context dependent DNNs based acoustic models. For Chinese, a syllabic language, the choice of basic modeling units under the background of DNNs based LVCSR systems is a very important issue.Three basic modeling units, syllables, initial/finals, phones, are discussed and compared. Experimental results show that, in the DNNs based systems, the context dependent (CD) phones obtain the best performance, and the context independent (CI) syllables have the similar performance with the CD initial/finals. How the number of clustered states impacts on the performance of DNNs based systems is also discussed, which showed different properties from the GMMs based systems. Besides, through introducing the multi-task learning strategy, these multiple modeling units can be combined in the DNNs training procedure. The experimental results indicate that combining these multiple modeling units using multi-task learning outperforms each individual modeling unit.

Context Dependent Initial/final Acoustic Modeling for Continuous Chinese Speech Recognition

Improved context-dependent acoustic modeling for continuous Chinese speech recognition

Initial/final acoustic model based on separating nasal coda in Chinese Putonghua speech recognition

Context Dependent Syllable Acoustic Model For Continuous Chinese Speech Recognition

Research on Context-Dependent Acoustical Unit (Triphone) for Mandarin Continuous Speech Recognition

Acoustic Modeling Based On Chinese Phonetics Knowledge

Research on Inter-Syllable Context-Dependent Acoustic Unit for Mandarin Continuous Speech Recognition.

The Definition and Extension of the Question Set for Decision Tree Based State Tying in Chinese Speech Recognition

A New Acoustic Modeling of Inter-Syllable Context-Dependent Units for Putonghua Continuous Speech Recognition

Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr

A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition

Modeling Pronunciation Variation Using Context-Dependent Weighting and B/s Refined Acoustic Modeling.

Deep neural networks for syllable based acoustic modeling in Chinese speech recognition.

English Alphabet Recognition Based on Chinese Acoustic Modeling

INTRA-SYLLABLE DEPENDENT PHONETIC MODELING FOR CHINESE SPEECH RECOGNITION

Automatic Initial/Final Generation For Dialectal Chinese Speech Recognition

A comparable study of modeling units for end-to-end Mandarin speech recognition

MANDARIN PRONUNCIATION VARIATION MODELING 1

Lightly supervised acoustic model training for mandarin continuous speech recognition

PHMM Based Asynchronous Acoustic Model for Chinese Large Vocabulary Continuous Speech Recognition

Mandarin Pronunciation Modeling Based on CASS Corpus.