Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

Joint Training Of Complex Ratio Mask Based Beamformer And Acoustic Model For Noise Robust Asr

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

An Iterative Mask Estimation Approach to Deep Learning Based Multi-Channel Speech Recognition

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Neural Spatio-Temporal Beamformer for Target Speech Separation

Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription

An efficient joint training model for monaural noisy-reverberant speech recognition

Masking-based Neural Beamformer for Multichannel Speech Enhancement

Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

Attention-Based Beamformer For Multi-Channel Speech Enhancement

Robust speech recognition using beamforming with adaptive microphone gains and multichannel noise reduction

LSTM-Based Iterative Mask Estimation and Post-Processing for Multi-Channel Speech Enhancement

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

Design of a robust MVDR beamforming method with Low-Latency by reconstructing covariance matrix for speech enhancement

Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition

Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations

Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Noise Robust Speech Recognition Using Multi-Channel Based Channel Selection And ChannelWeighting.