Abstract:Speech recognition system performance degrades in noisy environments. If the acoustic models are built using features of clean utterances, the features of a noisy test utterance would be acoustically mismatched with the trained model. This gives poor likelihoods and poor recognition accuracy. Model adaptation and feature normalisation are two broad areas that address this problem. While the former often gives better performance, the latter involves estimation of lesser number of parameters, making the system feasible for practical implementations. This research focuses on the efficacies of various subspace, statistical and stereo based feature normalisation techniques. A subspace projection based method has been investigated as a standalone and adjunct technique involving reconstruction of noisy speech features from a precomputed set of clean speech building-blocks. The building blocks are learned using non-negative matrix factorisation (NMF) on log-Mel filter bank coefficients, which form a basis for the clean speech subspace. The work provides a detailed study on how the method can be incorporated into the extraction process of Mel-frequency cepstral coefficients. Experimental results show that the new features are robust to noise, and achieve better results when combined with the existing techniques. The work also proposes a modification to the training process of SPLICE algorithm for noise robust speech recognition. It is based on feature correlations, and enables this stereo-based algorithm to improve the performance in all noise conditions, especially in unseen cases. Further, the modified framework is extended to work for non-stereo datasets where clean and noisy training utterances, but not stereo counterparts, are required. An MLLR-based computationally efficient run-time noise adaptation method in SPLICE framework has been proposed.

Normalized Recognition of Speech and Audio Events

Learning An Invariant Speech Representation

Study of A Novel Key Feature in Non-Cooperative Modulation Automatic Recognition

Affect-Insensitive Speaker Recognition by Feature Variety Training

An invariant convolution model and its Variational Bayesian Approximation approach via Students-t priors for acoustic imaging in colored noises

Variance Normalised Features for Language and Dialect Discrimination

Speech recognition adaptive clustering feature extraction algorithms based on the k-means algorithm and the normalized intra-class variance

Feature Normalisation for Robust Speech Recognition

Discovery and Separation of Features for Invariant Representation Learning

I-vector Dependent Feature Space Transformations for Adaptive Speech Recognition

Audio Bank: A High-Level Acoustic Signal Representation for Audio Event Recognition

Speech Feature Extraction in Broadcast Hosting Based on Fluctuating Equation Inversion

Voice Conversion Based Speaker Normalization for Acoustic Unit Discovery

Slow and steady: auditory features that drive learnability in animal vocalizations

A Scheme Discriminating Between Synthetic Speech and Normal Speech

A perceptually-motivated low-complexity instantaneous linear channel normalization technique applied to speaker verification

Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation

Modified Mean and Variance Normalization: Transforming to Utterance-Specific Estimates

Invariant Representations in Deep Learning for Optoacoustic Imaging

Adversarial Learning of Raw Speech Features for Domain Invariant Speech Recognition

Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy