Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

Reverb: Open-Source ASR and Diarization from Rev

S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations

Open Source Automatic Speech Recognition for German

Exploring the limits of decoder-only models trained on public speech recognition corpora

Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Vakyansh: ASR Toolkit for Low Resource Indic languages

RefXVC: Cross-Lingual Voice Conversion with Enhanced Reference Leveraging

FunASR: A Fundamental End-to-End Speech Recognition Toolkit

Svarah: Evaluating English ASR Systems on Indian Accents

ViSpeR: Multilingual Audio-Visual Speech Recognition

An open-source voice type classifier for child-centered daylong recordings

OpenVoice: Versatile Instant Voice Cloning

Reverb Conversion of Mixed Vocal Tracks Using an End-to-end Convolutional Deep Neural Network

The VoxCeleb Speaker Recognition Challenge: A Retrospective

Updated Corpora and Benchmarks for Long-Form Speech Recognition

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

What shall we do with an hour of data? Speech recognition for the un- and under-served languages of Common Voice

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Catch You and I Can: Revealing Source Voiceprint Against Voice Conversion