Abstract:Although great progress has been made in automatic speech recognition (ASR), significant performance degradation still exists in noisy environments. In this paper, a novel factor-aware training framework, named neural network-based multifactor aware joint training, is proposed to improve the recognition accuracy for noise robust speech recognition. This approach is a structured model which integrates several different functional modules into one computational deep model. We explore and extract speaker, phone, and environment factor representations using deep neural networks (DNNs), which are integrated into the main ASR DNN to improve classification accuracy. In addition, the hidden activations in the main ASR DNN are used to improve factor extraction, which in turn helps theASRDNN. All the model parameters, including those in the ASR DNN and factor extraction DNNs, are jointly optimized under the multitask learning framework. Unlike prior traditional techniques for the factor-aware training, our approach requires no explicit separate stages for factor extraction and adaptation. Moreover, the proposed neural network-based multifactor aware joint training can be easily combined with the conventional factor-aware training which uses the explicit factors, such as i-vector, noise energy, and T 60 value to obtain additional improvement. The proposed method is evaluated on two main noise robust tasks: the AMI single distant microphone task in which reverberation is the main concern, and the Aurora4 task in which multiple noise types exist. Experiments on both tasks show that the proposed model can significantly reduce word error rate (WER). The best configuration achieved more than 15% relative reduction in WER over the baselines on these two tasks.

TLS-NAP Algorithm for Text-Independent Speaker Recognition

Auditory Sparse Representation for Robust Speaker Recognition Based on Tensor Structure

Nonnegative Tensor PCA and Application to Speaker Recognition in Noise Environments

Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization

Discriminant Local Information Distance Preserving Projection for Text-Independent Speaker Recognition

Neural Networks for Improved Text-Independent Speaker Identification.

Orthogonal subspace combination based on the joint factor analysis for text-independent speaker recognition

Text-independent speaker recognition using LSTM-RNN and speech enhancement

Research on Embedded Text-Dependent Speaker Recognition Algorithms and Its Implementation

Non-negative Tensor Factorization for Speech Enhancement

Total Variability Factors Combination for Speaker Verification

EfficientTDNN: Efficient Architecture Search for Speaker Recognition

VTS-based Robust Speech Recognition

Modeling Speaker Variability Using Long Short-Term Memory Networks For Speech Recognition

Pseudo-Phoneme Label Loss for Text-Independent Speaker Verification

Speaker recognition based on improved ECAPA-TDNN network

Speaker Recognition Based on Long-Term Acoustic Features with Analysis Sparse Representation

Speaker Recognition Based on Long Short-Term Memory Networks

Improved PRSVM Method for Language Recognition

Robust Speaker Identification In Noise Using Missing Data Technique And Auditory Masking

Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition.