Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

Replacing Uncertainty Decoding with Subband Re-Estimation for Large Vocabulary Speech Recognition in Noise.

Combining Noise Compensation and Missing-Feature Decoding for Large Vocabulary Speech Recognition in Noise

Adapting noisy speech models — Extended uncertainty decoding

A Comparative Study of Noise Estimation Algorithms for Nonlinear Compensation in Robust Speech Recognition

An Algorithm of Model Compensation Based on the Estimation of Additive Noise and Channel Function for Speech Recognition

A Feature Compensation Approach Using Piecewise Linear Approximation of an Explicit Distortion Model for Noisy Speech Recognition

Hidden Markov Acoustic Modeling with Bootstrap and Restructuring for Low-Resourced Languages

Improving Short Utterance Speaker Recognition by Modeling Speech Unit Classes

Nonlinear Regularization Decoding Method for Speech Recognition

A Feature Compensation Approach Using High-Order Vector Taylor Series Approximation of an Explicit Distortion Model for Noisy Speech Recognition

Very Deep Convolutional Neural Networks for Robust Speech Recognition

Residual Noise Compensation For Robust Speech Recognition In Nonstationary Noise

Noise Robust Speech Recognition on Aurora4 by Humans and Machines.

An Improved VTS Feature Compensation Using Mixture Models of Distortion and IVN Training for Noisy Speech Recognition

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Adaptive Compensation Algorithm in Open Vocabulary Mandarin Speaker-Independent Speech Recognition

LC4SV: A Denoising Framework Learning to Compensate for Unseen Speaker Verification Models

Integrating Lattice-Free MMI into End-to-End Speech Recognition

An efficient joint training model for monaural noisy-reverberant speech recognition

Improving Uyghur ASR systems with decoders using morpheme-based language models

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition