Abstract:In our previous work, we proposed a feature compensation approach using high-order vector Taylor series (VTS) approximation for noisy speech recognition. In this paper, we report new progress on making it more powerful and practical in real applications. First, mixtures of densities are used to enhance the distortion models of both additive noise and convolutional distortion. New formulations for maximum likelihood (ML) estimation of distortion model parameters, and minimum mean squared error (MMSE) estimation of clean speech are derived and presented. Second, we improve the feature compensation in both efficiency and accuracy by applying higher order information of VTS approximation only to the noisy speech mean parameters, and a temporal smoothing operation for the posterior probability of Gaussian mixture components in clean speech estimation. Finally, we design a procedure to perform irrelevant variability normalization (IVN) based joint training of a reference Gaussian mixture model (GMM) for feature compensation and hidden Markov models (HMMs) for acoustic modeling using VTS-based feature compensation. The effectiveness of our proposed approach is confirmed by experiments on Aurora3 benchmark database for a real-world in-vehicle connected digits recognition task. Compared with ETSI advanced front-end, our approach achieves significant recognition accuracy improvement across three “training-testing” conditions for four languages.

Multi-Environment Model Adaptation Based on Vector Taylor Series for Robust Speech Recognition

VTS-based Robust Speech Recognition

Combining Eigenvoice Speaker Modeling And Vts-Based Environment Compensation For Robust Speech Recognition

Learning Virtual HD Model for Bi-model Emotional Speaker Recognition

Using vector taylor series with noise clustering for speech recognition in non-stationary noisy environnlents

Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems

A New Robust Telephone Speech Recognition Algorithm With The Multi-Model Structures

Autoregressive Model-Based Robust Speech Recognition in Additive Noise Environment

CTC Regularized Model Adaptation for Improving LSTM RNN Based Multi-Accent Mandarin Speech Recognition

Application of VTS Approximation Based Feature Compensation Approach to Speech Recognition

Partially Adaptive Multichannel Joint Reduction of Ego-noise and Environmental Noise

An Improved VTS Feature Compensation Using Mixture Models of Distortion and IVN Training for Noisy Speech Recognition

Robust Speech Recognition Method Based on Discriminative Learning of Environmental Features

Modeling Speaker Variability Using Long Short-Term Memory Networks For Speech Recognition

Enhancing CTC-based speech recognition with diverse modeling units

Multi-Channel Feature Adaptation for Robust Speech Recognition

Robust Speech Recognition Method Based on Discriminative Environment Feature Extraction

An Efficient Robust Asr System Based On The Combination Of Speech Enhancement And Hmm Adaptation

Speech Selection and Environmental Adaptation for Asynchronous Speech Recognition

A VTS-based Feature Compensation Approach to Noisy Speech Recognition Using Mixture Models of Distortion

Residual Noise Compensation For Robust Speech Recognition In Nonstationary Noise