Abstract:Introduction: An Automatic Speech Recognition (ASR) system enables to recognize the speech utterances and thus can be used to convert speech into text for various purposes. These systems are deployed in different environments such as clean or noisy and are used by all ages or types of people. These also present some of the major difficulties faced in the development of an ASR system. Thus, an ASR system need to be efficient, while also being accurate and robust. Our main goal is to minimize the error rate during training as well as testing phases, while implementing an ASR system. Performance of ASR depends upon different combinations of feature extraction techniques and back-end techniques. In this paper, using a continuous speech recognition system, the performance comparison of different combinations of feature extraction techniques and various types of back-end techniques has been presented Methods: Hidden Markov Models (HMMs), Subspace Gaussian Mixture Models (SGMMs) and Deep Neural Networks (DNNs) with DNN-HMM architecture, namely Karel’s, Dan’s and Hybrid DNN-SGMM architecture are used at the back-end of the implemented system. Mel frequency Cepstral Coefficient (MFCC), Perceptual Linear Prediction (PLP), and Gammatone Frequency Cepstral coefficients (GFCC) are used as feature extraction techniques at the front-end of the proposed system. Kaldi toolkit has been used for the implementation of the proposed work. The system is trained on the Texas Instruments-Massachusetts Institute of Technology (TIMIT) speech corpus for English language Results: The experimental results show that MFCC outperforms GFCC and PLP in noiseless conditions, while PLP tends to outperform MFCC and GFCC in noisy conditions. Furthermore, the hybrid of Dan’s DNN implementation along with SGMM performs the best for the back-end acoustic modeling. The proposed architecture with PLP feature extraction technique in the front end and hybrid of Dan’s DNN implementation along with SGMM at the back end outperforms the other combinations in a noisy environment. Conclusion: Automatic Speech recognition has numerous applications in our lives like Home automation, Personal assistant, Robotics etc. It is highly desirable to build an ASR system with good performance. The performance Automatic Speech Recognition is affected by various factors which include vocabulary size, whether system is speaker dependent or independent, whether speech is isolated, discontinuous or continuous, adverse conditions like noise. The paper presented an ensemble architecture that uses PLP for feature extraction at the front end and a hybrid of SGMM + Dan’s DNN in the backend to build a noise robust ASR system Discussion: The presented work in this paper discusses the performance comparison of continuous ASR systems developed using different combinations of front-end feature extraction (MFCC, PLP, and GFCC) and back-end acoustic modeling (mono-phone, tri-phone, SGMM, DNN and hybrid DNN-SGMM) techniques. Each type of front-end technique is tested in combination with each type of back-end technique. Finally, it compares the results of the combinations thus formed, to find out the best performing combination in noisy and clean conditions

Investigation of Monaural Front-End Processing for Robust Speech Recognition Without Retraining or Joint-Training

Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR

Joint Training Of Front-End And Back-End Deep Neural Networks For Robust Speech Recognition

Performance Analysis of various Front-end and Back End Amalgamations for Noise-robust DNN-based ASR

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

Time-Domain Speech Enhancement for Robust Automatic Speech Recognition

Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization

Robust Speech Recognition With Speech Enhanced Deep Neural Networks

Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR

On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training

Joint Noise and Mask Aware Training for DNN-based Speech Enhancement with SUB-band Features

A Time Domain Progressive Learning Approach with SNR Constriction for Single-Channel Speech Enhancement and Recognition

Cross-domain Single-channel Speech Enhancement Model with Bi-projection Fusion Module for Noise-robust ASR

Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition.

An efficient joint training model for monaural noisy-reverberant speech recognition

How does end-to-end speech recognition training impact speech enhancement artifacts?

A Progressive Learning Approach to Adaptive Noise and Speech Estimation for Speech Enhancement and Noisy Speech Recognition.

Joint Training of DNNs by Incorporating an Explicit Dereverberation Structure for Distant Speech Recognition

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition

Mixed-Bandwidth Cross-Channel Speech Recognition Via Joint Optimization of DNN-Based Bandwidth Expansion and Acoustic Modeling.