Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

Acoustic Modeling for Multi-Array Conversational Speech Recognition in the Chime-6 Challenge

Space-and-speaker-aware Acoustic Modeling with Effective Data Augmentation for Recognition of Multi-Array Conversational Speech

Acoustic Model Ensembling Using Effective Data Augmentation for CHiME-5 Challenge.

Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System

On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge

Channel selection using neural network posterior probability for speech recognition with distributed microphone arrays in everyday environments

3-D Feature and Acoustic Modeling for Far-Field Speech Recognition

Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline

Robust Speech Recognition With Speech Enhanced Deep Neural Networks

A Two-stage Single-channel Speaker-dependent Speech Separation Approach for Chime-5 Challenge.

Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Enhancing CTC-based speech recognition with diverse modeling units

Acoustic Model Fusion for End-to-end Speech Recognition

Multi-Span Acoustic Modelling Using Raw Waveform Signals.

A Space-and-Speaker-Aware Iterative Mask Estimation Approach to Multi-Channel Speech Recognition in the CHiME-6 Challenge.