Abstract:Speech recognition system performance degrades in noisy environments. If the acoustic models are built using features of clean utterances, the features of a noisy test utterance would be acoustically mismatched with the trained model. This gives poor likelihoods and poor recognition accuracy. Model adaptation and feature normalisation are two broad areas that address this problem. While the former often gives better performance, the latter involves estimation of lesser number of parameters, making the system feasible for practical implementations. This research focuses on the efficacies of various subspace, statistical and stereo based feature normalisation techniques. A subspace projection based method has been investigated as a standalone and adjunct technique involving reconstruction of noisy speech features from a precomputed set of clean speech building-blocks. The building blocks are learned using non-negative matrix factorisation (NMF) on log-Mel filter bank coefficients, which form a basis for the clean speech subspace. The work provides a detailed study on how the method can be incorporated into the extraction process of Mel-frequency cepstral coefficients. Experimental results show that the new features are robust to noise, and achieve better results when combined with the existing techniques. The work also proposes a modification to the training process of SPLICE algorithm for noise robust speech recognition. It is based on feature correlations, and enables this stereo-based algorithm to improve the performance in all noise conditions, especially in unseen cases. Further, the modified framework is extended to work for non-stereo datasets where clean and noisy training utterances, but not stereo counterparts, are required. An MLLR-based computationally efficient run-time noise adaptation method in SPLICE framework has been proposed.

Speech Feature Mapping based on Switching Linear Dynamic System

Synthesized Stereo-Based Stochastic Mapping with Data Selection for Robust Speech Recognition.

A Monocular SLAM System with Mask Loop Closing

Exploring Retraining-Free Speech Recognition for Intra-sentential Code-Switching

Mlc-slam: mask loop closing for monocular slam

Feature mapping of multiple beamformed sources for robust overlapping speech recognition using a microphone array

Incorporating Symbolic Sequential Modeling for Speech Enhancement

Learning and Inferring Motion Patterns using Parametric Segmental Switching Linear Dynamic Systems

Feature Normalisation for Robust Speech Recognition

A Novel LSTM-Based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions.

Synthesized Stereo Mapping Via Deep Neural Networks for Noisy Speech Recognition

Non-autoregressive Mandarin-English Code-switching Speech Recognition

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Modeling Speaker Variability Using Long Short-Term Memory Networks For Speech Recognition

Multiple-target Deep Learning for LSTM-RNN Based Speech Enhancement

Linguistic-Coupled Age-to-Age Voice Translation to Improve Speech Recognition Performance in Real Environments

Transformer-Transducers for Code-Switched Speech Recognition

Dynamic Speaker Selected Training for Rapid Speaker Adaptation

Robust Speech Recognition Based on Spectral Adjusting and Warping

Recursive Whitening Transformation for Speaker Recognition on Language Mismatched Condition

Multi-Dialect Speech Recognition With A Single Sequence-To-Sequence Model