Abstract:The hybrid deep neural network (DNN) and hidden Markov model (HMM) has recently achieved dramatic performance gains in automatic speech recognition (ASR). The DNN-based acoustic model is very powerful but its learning process is extremely time-consuming. In this paper, we propose a novel DNN-based acoustic modeling framework for speech recognition, where the posterior probabilities of HMM states are computed from multiple DNNs (mDNN), instead of a single large DNN, for the purpose of parallel training towards faster turnaround. In the proposed mDNN method all tied HMM states are first grouped into several disjoint clusters based on data-driven methods. Next, several hierarchically structured DNNs are trained separately in parallel for these clusters using multiple computing units (e.g. GPUs). In decoding, the posterior probabilities of HMM states can be calculated by combining outputs from multiple DNNs. In this work, we have shown that the training procedure of the mDNN under popular criteria, including both frame-level cross-entropy and sequence-level discriminative training, can be parallelized efficiently to yield significant speedup. The training speedup is mainly attributed to the fact that multiple DNNs are parallelized over multiple GPUs and each DNN is smaller in size and trained by only a subset of training data. We have evaluated the proposed mDNN method on a 64-hour Mandarin transcription task and the 320-hour Switchboard task. Compared to the conventional DNN, a 4-cluster mDNN model with similar size can yield comparable recognition performance in Switchboard (only about 2% performance degradation) with a greater than 7 times speed improvement in CE training and a 2.9 times improvement in sequence training, when 4 GPUs are used.

Standalone Training of Context-Dependent Deep Neural Network Acoustic Models

Context-dependent Deep Neural Networks for audio indexing of real-life data

Conversational Speech Transcription Using Context-Dependent Deep Neural Networks

Improving deep neural network acoustic models using unlabeled data

Decision tree based state tying for speech recognition using DNN derived embeddings

Labeling Unsegmented Sequence Data with DNN-HMM and Its Application for Speech Recognition.

Word Alignment Modeling with Context Dependent Deep Neural Network.

Speaker Cluster-Based Speaker Adaptive Training for Deep Neural Network Acoustic Modeling

Phonotactic language recognition based on DNN-HMM acoustic model

An Investigation of High-Resolution Modeling Units of Deep Neural Networks for Acoustic Scene Classification

Error Back Propagation for Sequence Training of Context-Dependent Deep NetworkS for Conversational Speech Transcription

Deep Recurrent Neural Networks for Acoustic Modelling

Acceleration Strategies for Speech Recognition Based on Deep Neural Networks

Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling

An Acoustic Model for English Speech Recognition Based on Deep Learning

State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Hidden Markov Acoustic Modeling with Bootstrap and Restructuring for Low-Resourced Languages

Acoustic Modeling With Dfsmn-Ctc And Joint Ctc-Ce Learning

A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition