Abstract:State-of-the-art automatic speech recognition (ASR) system development is data and computation intensive. The optimal design of deep neural networks (DNNs) for these systems often require expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of factored time delay neural networks (TDNN-Fs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer. These techniques include the differentiable neural architecture search (DARTS) method integrating architecture learning with lattice-free MMI training; Gumbel-Softmax and pipelined DARTS methods reducing the confusion over candidate architectures and improving the generalization of architecture selection; and Penalized DARTS incorporating resource constraints to balance the trade-off between performance and system complexity. Parameter sharing among TDNN-F architectures allows an efficient search over up to $7^{28}$ different systems. Statistically significant word error rate (WER) reductions of up to 1.2 absolute and relative model size reduction of 31 were obtained over a state-of-the-art 300-hour Switchboard corpus trained baseline LF-MMI TDNN-F system featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation as well as RNNLM rescoring. Performance contrasts on the same task against recent end-to-end systems reported in the literature suggest the best NAS auto-configured system achieves state-of-the-art WERs of 9.9 and 11.1 on the NIST Hub5 00 and Rt03 s test sets respectively with up to 96 model size reduction. Further analysis using Bayesian learning shows that the proposed NAS approaches can effectively minimize the structural redundancy in the TDNN-F systems and reduce their model parameter uncertainty. Consistent performance improvements were also obtained on a UASpeech dysarthric speech recognition task.

Modeling long temporal contexts for robust DNN-based speech recognition

Incoherent training of deep neural networks to de-correlate bottleneck features for speech recognition

Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling

Mongolian acoustic modeling based on deep neural network

Multi-task Joint-Learning of Deep Neural Networks for Robust Speech Recognition

State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition

A Cluster-Based Multiple Deep Neural Networks Method for Large Vocabulary Continuous Speech Recognition

Standalone Training of Context-Dependent Deep Neural Network Acoustic Models

Improving Deep CNN Networks with Long Temporal Context for Text-Independent Speaker Verification

Deep Recurrent Neural Networks for Acoustic Modelling

Multi-Resolution Stacking For Speech Separation Based On Boosted Dnn

Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection

Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition.

Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks

Modeling Speaker Variability Using Long Short-Term Memory Networks For Speech Recognition

Gated Recurrent Unit Based Acoustic Modeling with Future Context

Improving Gated Recurrent Unit Based Acoustic Modeling with Batch Normalization and Enlarged Context.

Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition.

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

A Deep Ensemble Learning Method for Monaural Speech Separation.

Multistate Encoding with End-To-End Speech RNN Transducer Network