Abstract:Understanding architectural choices for deep neural networks (DNNs) is crucial to improving state-of-the-art speech recognition systems. We investigate which aspects of DNN acoustic model design are most important for speech recognition system performance, focusing on feed-forward networks. We study the effects of parameters like model size (number of layers, total parameters), architecture (convolutional networks), and training details (loss function, regularization methods) on DNN classifier performance and speech recognizer word error rates. On the Switchboard benchmark corpus we compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. Using a much larger 2100-hour training corpus (combining Switchboard and Fisher) we examine the performance of very large DNN models – with up to ten times more parameters than those typically used in speech recognition systems. The results suggest that a relatively simple DNN architecture and optimization technique give strong performance, and we offer intuitions about architectural choices like network depth over breadth. Our findings extend previous works to help establish a set of best practices for building DNN hybrid speech recognition systems and constitute an important first step toward analyzing more complex recurrent, sequence-discriminative, and HMM-free architectures.

Convolutional maxout neural networks for low-resource speech recognition

Deep Maxout Neural Networks for Speech Recognition

Maxout Neurons for Deep Convolutional and LSTM Neural Networks in Speech Recognition

Stochastic Pooling Maxout Networks for Low-Resource Speech Recognition

Convolutional Maxout Neural Networks for Speech Separation

Advanced Recurrent Network-Based Hybrid Acoustic Models for Low Resource Speech Recognition

Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview

Very Deep Convolutional Neural Networks for Robust Speech Recognition

Cross-Lingual and Ensemble MLPs Strategies for Low-Resource Speech Recognition

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Articulatory Feature Based Multilingual MLPs for Low-Resource Speech Recognition.

Multi-scale Feature Based Convolutional Neural Networks for Large Vocabulary Speech Recognition

MLP-HMM Two-Stage Unsupervised Training for Low-Resource Languages on Conversational Telephone Speech Recognition

Towards Speaker Identification with Minimal Dataset and Constrained Resources using 1D-Convolution Neural Network

Long short term memory recurrent neural network acoustic models using i-vector for low resource speech recognition

Building DNN acoustic models for large vocabulary speech recognition

Research Progress on Key Technologies of Low Resource Speech Recognition

Improving deep neural networks for LVCSR using dropout and shrinking structure

Structure Growth for Small-Footprint Speech Recognition

Speech Recognition using Convolution Deep Neural Networks