Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

Overlapped speech recognition from a jointly learned multi-channel neural speech extraction and representation

Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

Joint Training Of Front-End And Back-End Deep Neural Networks For Robust Speech Recognition

End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning

Double Branches and Stages Neural Network for Joint Acoustic Echo and Noise Suppression

Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Joint Speech Activity and Overlap Detection with Multi-Exit Architecture

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

Target Speaker Extraction for Overlapped Multi-Talker Speaker Verification

Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition

Cascaded encoders for fine-tuning ASR models on overlapped speech

An efficient joint training model for monaural noisy-reverberant speech recognition

Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets

Multi-Objective Learning and Mask-Based Post-Processing for Deep Neural Network Based Speech Enhancement

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

Multichannel Speech Enhancement without Beamforming

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

Mixed-Bandwidth Cross-Channel Speech Recognition Via Joint Optimization of DNN-Based Bandwidth Expansion and Acoustic Modeling.