Abstract:The hybrid deep neural network (DNN) and hidden Markov model (HMM) has recently achieved dramatic performance gains in automatic speech recognition (ASR). The DNN-based acoustic model is very powerful but its learning process is extremely time-consuming. In this paper, we propose a novel DNN-based acoustic modeling framework for speech recognition, where the posterior probabilities of HMM states are computed from multiple DNNs (mDNN), instead of a single large DNN, for the purpose of parallel training towards faster turnaround. In the proposed mDNN method all tied HMM states are first grouped into several disjoint clusters based on data-driven methods. Next, several hierarchically structured DNNs are trained separately in parallel for these clusters using multiple computing units (e.g. GPUs). In decoding, the posterior probabilities of HMM states can be calculated by combining outputs from multiple DNNs. In this work, we have shown that the training procedure of the mDNN under popular criteria, including both frame-level cross-entropy and sequence-level discriminative training, can be parallelized efficiently to yield significant speedup. The training speedup is mainly attributed to the fact that multiple DNNs are parallelized over multiple GPUs and each DNN is smaller in size and trained by only a subset of training data. We have evaluated the proposed mDNN method on a 64-hour Mandarin transcription task and the 320-hour Switchboard task. Compared to the conventional DNN, a 4-cluster mDNN model with similar size can yield comparable recognition performance in Switchboard (only about 2% performance degradation) with a greater than 7 times speed improvement in CE training and a 2.9 times improvement in sequence training, when 4 GPUs are used.

Combining Information from Multi-Stream Features Using Deep Neural Network in Speech Recognition

An information fusion framework with multi-channel feature concatenation and multi-perspective system combination for the deep-learning-based robust recognition of microphone array speech

State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition

Deep Neural Network-based Mixed Speech Recognition Technology for Chinese and English

A Fusion Approach to Spoken Language Identification Based on Combining Multiple Phone Recognizers and Speech Attribute Detectors

Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework

Multi-feature Combination for Speaker Recognition

English speech recognition based on deep learning with multiple features

Speech recognition method based on DNN-LSTM combined with Wiener filtering algorithm

Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting

Deep joint learning for language recognition

An HMM/MFNN Hybrid Architecture Based on Stacked Generalization for Speaker Identification

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Fusing audio and visual features of speech

Fusion of deep shallow features and models for speaker recognition

Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition

Hybrid speech recognition based on improved hidden markov model and neural network

An Investigation into Using Parallel Data for Far-Field Speech Recognition.

Joint Training Of Front-End And Back-End Deep Neural Networks For Robust Speech Recognition