Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

Applicability of End-to-End Deep Neural Architecture to Sinhala Speech Recognition

Bidirectional RNN for Audio Deep Learning in an End-to-End Model

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Evaluation of Noise Reduction Methods for Sentence Recognition by Sinhala Speaking Listeners

Speech recognition with deep recurrent neural networks

Deep learning based large vocabulary continuous speech recognition of an under-resourced language Bangladeshi Bangla

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Deep Learning Speech Synthesis Model for Word/Character-Level Recognition in the Tamil Language

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages

Acoustic Model Fusion for End-to-end Speech Recognition

Real-time translation of discrete Sinhala speech to Unicode text

Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition

EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet

Development of an End-to-End Deep Learning Framework for Sign Language Recognition, Translation, and Video Generation

Lip Synchronization Model For Sinhala Language Using Machine Learning

Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU

End-to-End Architectures for Speech Recognition

Deep LSTM for Large Vocabulary Continuous Speech Recognition