Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework

Accent Recognition with Hybrid Phonetic Features

E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition

Investigation of Deep Neural Network Acoustic Modelling Approaches for Low Resource Accented Mandarin Speech Recognition

Improved BLSTM RNN Based Accent Speech Recognition Using Multi-task Learning and Accent Embeddings

Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis

Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition

END-TO-END MULTI-ACCENT SPEECH RECOGNITION WITH UNSUPERVISED ACCENT MODELLING

Leveraging native language information for improved accented speech recognition

Deep Neural Network-based Mixed Speech Recognition Technology for Chinese and English

Towards Language-Universal Mandarin-English Speech Recognition

A unified system for multilingual speech recognition and language identification

Reliable accent specific unit generation with dynamic Gaussian mixture selection for multi-accent speech recognition

Adaptive Attention Network with Domain Adversarial Training for Multi-Accent Speech Recognition

Detection-based accented speech recognition using articulatory features.

Distant-Talking Accent Recognition by Combining Gmm and Dnn

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Towards Robust Mispronunciation Detection and Diagnosis for L2 English Learners with Accent-Modulating Methods

Mandarin-English bilingual phone modeling and combining MPE based Discriminative training for cross-language speech recognition