Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

A Maximum Likelihood Approach to Deep Neural Network Based Speech Dereverberation

A Maximum Likelihood Approach to Deep Neural Network Based Nonlinear Spectral Mapping for Single-Channel Speech Separation

Gaussian Density Guided Deep Neural Network For Single-Channel Speech Enhancement

Error Modeling Via Asymmetric Laplace Distribution for Deep Neural Network Based Single-Channel Speech Enhancement

Multichannel Linear Prediction-Based Speech Dereverberation Considering Sparse and Low-Rank Priors

Using Generalized Gaussian Distributions to Improve Regression Error Modeling for Deep Learning-Based Speech Enhancement.

A regression approach to speech enhancement based on deep neural networks

A Maximum Likelihood Approach to Multi-Objective Learning Using Generalized Gaussian Distributions for Dnn-Based Speech Enhancement.

Monaural Speech Dereverberation using Deformable Convolutional Networks

A Research to Speech Dereverberation Method Based on BLSTM Recurrent Neural Networks and Non-negative Matrix Factorization

An Experimental Study on Speech Enhancement Based on Deep Neural Networks

Simultaneous Denoising and Dereverberation Using Deep Embedding Features

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Speech Dereverberation Based on Sparse Matrix Decomposition

A Maximum Likelihood Approach to SNR-Progressive Learning Using Generalized Gaussian Distribution for LSTM-Based Speech Enhancement.

Multi-task single channel speech enhancement using speech presence probability as a secondary task training target

Deep Neural Network-Based Bottleneck Feature and Denoising Autoencoder-Based Dereverberation for Distant-Talking Speaker Identification.

Joint Training of DNNs by Incorporating an Explicit Dereverberation Structure for Distant Speech Recognition

Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations

On phase recovery and preserving early reflections for deep-learning speech dereverberation

A DNN Parameter Mask for the Binaural Reverberant Speech Segregation