Abstract:The choice of acoustic modeling units is critical to acoustic modeling in large vocabulary continuous speech recognition (LVCSR) tasks. The recent connectionist temporal classification (CTC) based acoustic models have more options for the choice of modeling units. In this work, we propose a DFSMN-CTC-sMBR acoustic model and investigate various modeling units for Mandarin speech recognition. In addition to the commonly used context-independent Initial/Finals (CI-IF), context-dependent Initial/Finals (CD-IF) and Syllable, we also propose a hybrid Character-Syllable modeling units by mixing high frequency Chinese characters and syllables. Experimental results show that DFSMN-CTC-sMBR models with all these types of modeling units can significantly outperform the well-trained conventional hybrid models. Moreover, we find that the proposed hybrid Character-Syllable modeling units is the best choice for CTC based acoustic modeling for Mandarin speech recognition in our work since it can dramatically reduce substitution errors in recognition results. In a 20,000 hours Mandarin speech recognition task, the DFSMN-CTC-sMBR system with hybrid Character-Syllable achieves a character error rate (CER) of 7.45% while performance of the well-trained DFSMN-CE-sMBR system is 9.49%.

Using Different Models to Label the Break Indices for Mandarin Speech Synthesis

Comparison of Approaches for Predicting Break Indices in Mandarin Speech Synthesis

Assigning Break Indices for Unrestricted Texts in Mandarin Text to Speech System

Automatic Phrase Breaks Prediction in Chinese Sentences

A Novel Hybrid Mandarin Speech Synthesis System Using Different Base Units for Model Training and Concatenation

A Hierarchical Viterbi Algorithm For Mandarin Hybrid Speech Synthesis System

Prosodic Word Boundaries Prediction for Mandarin Text-to-Speech

Mandarin Stress Analysis And Prediction For Speech Synthesis

Hierarchical Stress Modeling in Mandarin Text-to-Speech

Learning Prosodic Patterns for Mandarin Speech Synthesis

Mandarin Pronunciation Modeling Based on CASS Corpus.

A Maximum Entropy Based Hierarchical Model for Automatic Prosodic Boundary Labeling in Mandarin

Comparison of Syllable/Phone HMM Based Mandarin TTS

Research on Modeling Units of Transformer Transducer for Mandarin Speech Recognition

Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

Automatic Phrase Boundary Labeling for a Mandarin TTS Corpus Using the Viterbi Decoding Algorithm

A Novel Hybrid Approach for Mandarin Speech Synthesis

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

Syllable HMM Based Mandarin TTS and Comparison with Concatenative TTS.

The Pause Duration Prediction for Mandarin Text-to-speech System