Abstract:The hybrid deep neural network (DNN) and hidden Markov model (HMM) has recently achieved dramatic performance gains in automatic speech recognition (ASR). The DNN-based acoustic model is very powerful but its learning process is extremely time-consuming. In this paper, we propose a novel DNN-based acoustic modeling framework for speech recognition, where the posterior probabilities of HMM states are computed from multiple DNNs (mDNN), instead of a single large DNN, for the purpose of parallel training towards faster turnaround. In the proposed mDNN method all tied HMM states are first grouped into several disjoint clusters based on data-driven methods. Next, several hierarchically structured DNNs are trained separately in parallel for these clusters using multiple computing units (e.g. GPUs). In decoding, the posterior probabilities of HMM states can be calculated by combining outputs from multiple DNNs. In this work, we have shown that the training procedure of the mDNN under popular criteria, including both frame-level cross-entropy and sequence-level discriminative training, can be parallelized efficiently to yield significant speedup. The training speedup is mainly attributed to the fact that multiple DNNs are parallelized over multiple GPUs and each DNN is smaller in size and trained by only a subset of training data. We have evaluated the proposed mDNN method on a 64-hour Mandarin transcription task and the 320-hour Switchboard task. Compared to the conventional DNN, a 4-cluster mDNN model with similar size can yield comparable recognition performance in Switchboard (only about 2% performance degradation) with a greater than 7 times speed improvement in CE training and a 2.9 times improvement in sequence training, when 4 GPUs are used.

Empirically Combining Unnormalized NNLM and Back-off N -Gram for Fast N -Best Rescoring in Speech Recognition

Efficient One-Pass Decoding with Nnlm for Speech Recognition

Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models

Variance regularization of RNNLM for speech recognition

Acceleration Strategies for Speech Recognition Based on Deep Neural Networks

Improving Accented Mandarin Speech Recognition by Using Recurrent Neural Network Based Language Model Adaptation

Improving deep neural networks for LVCSR using dropout and shrinking structure

Recurrent Neural Network Based Language Model Adaptation for Accent Mandarin Speech.

Advanced Recurrent Network-Based Hybrid Acoustic Models for Low Resource Speech Recognition

A Cluster-Based Multiple Deep Neural Networks Method for Large Vocabulary Continuous Speech Recognition

State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition

Combining Hybrid DNN-HMM ASR Systems with Attention-Based Models Using Lattice Rescoring

Improving the Decoding Efficiency of Deep Neural Network Acoustic Models by Cluster-Based Senone Selection.

Single-Channel Speech Enhancement Algorithm Based on ME-MGCRN in Low Signal-to-Noise Scenario

Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition

Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling

Cluster-Based Senone Selection for the Efficient Calculation of Deep Neural Network Acoustic Models

Compact Feedforward Sequential Memory Networks for Large Vocabulary Continuous Speech Recognition

Multiple-target Deep Learning for LSTM-RNN Based Speech Enhancement

Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition