Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

Robust Submodular Data Partitioning for Distributed Speech Recognition

Distributed Submodular Maximization for Large Vocabulary Continuous Speech Recognition

Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition: A comparison of current training strategies

Training data selection for acoustic modeling via submodular optimization of joint kullback-leibler divergence

A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Robust speech recognition using consensus function based on multi-layer networks

A Unified DNN Approach to Speaker-Dependent Simultaneous Speech Enhancement and Speech Separation in Low SNR Environments

A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks.

Distributed speech separation in spatially unconstrained microphone arrays

Robust Speech Recognition With Speech Enhanced Deep Neural Networks

A hybrid discriminant fuzzy DNN with enhanced modularity bat algorithm for speech recognition

Mixture Encoder for Joint Speech Separation and Recognition

Efficient, Cluster-Informed, Deep Speech Separation with Cross-Cluster Information in AD-HOC Wireless Acoustic Sensor Networks

SOT Triggered Neural Clustering for Speaker Attributed ASR

A Cluster-Based Multiple Deep Neural Networks Method for Large Vocabulary Continuous Speech Recognition

Speech Separation Based on Signal-Noise-dependent Deep Neural Networks for Robust Speech Recognition

Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization