Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

Monaural Speech Dereverberation using Deformable Convolutional Networks

A Deep Proximal-Unfolding Method for Monaural Speech Dereverberation

A Fast Convolutional Self-Attention Based Speech Dereverberation Method For Robust Speech Recognition

Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Multi-resolution Convolutional Residual Neural Networks for Monaural Speech Dereverberation

Simultaneous Denoising and Dereverberation Using Deep Embedding Features

Speech enhancement with frequency domain auto-regressive modeling

On phase recovery and preserving early reflections for deep-learning speech dereverberation

Utterance Weighted Multi-Dilation Temporal Convolutional Networks for Monaural Speech Dereverberation

Supervised Single-Channel Speech Dereverberation and Denoising Using a Two-Stage Model Based Sparse Representation.

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

A reverberation-time-aware DNN approach leveraging spatial information for microphone array dereverberation

Dereverberation Based on Spectral Subtraction by Multi-channel LMS Algorithm for Hands-free Speech Recognition

Supervised Single-Channel Speech Dereverberation And Denoising Using A Two-Stage Processing

Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization

A neural network-supported two-stage algorithm for lightweight dereverberation on hearing devices

End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming

Speech Dereverberation for Enhancement and Recognition Using Dynamic Features Constrained Deep Neural Networks and Feature Adaptation

Deep Learning Applied to Dereverberation and Sound Event Classification in Reverberant Environments

Phase and Reverberation Aware DNN for Distant-Talking Speech Enhancement