Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

What does a network layer hear? analyzing hidden representations of end-to-end asr through speech synthesis

What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems

Research on deep neural network's hidden layers in phoneme recognition

Replay and Synthetic Speech Detection with Res2net Architecture

Successes and critical failures of neural networks in capturing human-like speech recognition

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Probing self-supervised speech models for phonetic and phonemic information: a case study in aspiration

On the similarities of representations in artificial and brain neural networks for speech recognition

End-to-End Architectures for Speech Recognition

Robustness of Speech Spoofing Detectors Against Adversarial Post-Processing of Voice Conversion

Toward a Better Understanding of Deep Neural Network Based Acoustic Modelling: An Empirical Investigation

Voice spoofing detection with raw waveform based on Dual Path Res2net

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

Bi-Level Speaker Supervision for One-Shot Speech Synthesis

Synthetic Voice Detection and Audio Splicing Detection using SE-Res2Net-Conformer Architecture

What Do Speech Foundation Models Not Learn About Speech?

Dissecting neural computations in the human auditory pathway using deep neural networks for speech

Deep neural networks based speaker modeling at different levels of phonetic granularity

Exploring emergent syllables in end-to-end automatic speech recognizers through model explainability technique