Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

Speech Selection and Environmental Adaptation for Asynchronous Speech Recognition

Deep Neural Network-Based Bottleneck Feature and Denoising Autoencoder-Based Dereverberation for Distant-Talking Speaker Identification.

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments

Joint sparse representation based cepstral-domain dereverberation for distant-talking speech recognition

On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition

Joint Training of DNNs by Incorporating an Explicit Dereverberation Structure for Distant Speech Recognition

Domain Adaptation with Augmented Data by Deep Neural Network Based Method Using Re-Recorded Speech for Automatic Speech Recognition in Real Environment

Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones

Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising

Online Speaker Adaptation Using Memory-Aware Networks for Speech Recognition

Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription

Channel selection using neural network posterior probability for speech recognition with distributed microphone arrays in everyday environments

Robust Speech Recognition With Speech Enhanced Deep Neural Networks

Dereverberantion Based on Generalized Spectral Subtraction for Distant-Talking Speaker Recognition

A New Real-Time Noise Suppression Algorithm for Far-Field Speech Communication Based on Recurrent Neural Network