Abstract:Although great progress has been made in automatic speech recognition (ASR), significant performance degradation still exists in noisy environments. In this paper, a novel factor-aware training framework, named neural network-based multifactor aware joint training, is proposed to improve the recognition accuracy for noise robust speech recognition. This approach is a structured model which integrates several different functional modules into one computational deep model. We explore and extract speaker, phone, and environment factor representations using deep neural networks (DNNs), which are integrated into the main ASR DNN to improve classification accuracy. In addition, the hidden activations in the main ASR DNN are used to improve factor extraction, which in turn helps theASRDNN. All the model parameters, including those in the ASR DNN and factor extraction DNNs, are jointly optimized under the multitask learning framework. Unlike prior traditional techniques for the factor-aware training, our approach requires no explicit separate stages for factor extraction and adaptation. Moreover, the proposed neural network-based multifactor aware joint training can be easily combined with the conventional factor-aware training which uses the explicit factors, such as i-vector, noise energy, and T 60 value to obtain additional improvement. The proposed method is evaluated on two main noise robust tasks: the AMI single distant microphone task in which reverberation is the main concern, and the Aurora4 task in which multiple noise types exist. Experiments on both tasks show that the proposed model can significantly reduce word error rate (WER). The best configuration achieved more than 15% relative reduction in WER over the baselines on these two tasks.

Speech Recognition Based on Deep Tensor Neural Network and Multifactor Feature.

Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition.

Deep Factorization for Speech Signal

Robust Speech Recognition Combining Cepstral and Articulatory Features

Robust Multifactor Speech Feature Extraction Based on Gabor Analysis

Performance Optimization of Speech Recognition System with Deep Neural Network Model

Uyghur speech recognition based on deep neural network

Deep Neural Network-based Mixed Speech Recognition Technology for Chinese and English

Combining Information from Multi-Stream Features Using Deep Neural Network in Speech Recognition

Robust Speech Recognition With Speech Enhanced Deep Neural Networks

Multi-scale Feature Based Convolutional Neural Networks for Large Vocabulary Speech Recognition

Learning Deep Multimodal Affective Features for Spontaneous Speech Emotion Recognition.

Multimodal emotion recognition based on deep neural network

Research on Improving Phoneme Recognition Rate Based on Subspace Gaussian Mixture Model and Deep Neural Network Combination

Speech Emotion Recognition Based on Multi-task Deep Feature Extraction and MKPCA Feature Fusion

Emotion recognition using support vector machine and deep neural network

State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition

A network model of speaker identification with new feature extraction methods and asymmetric BLSTM

Integrated Adaptation with Multi-Factor Joint-Learning for Far-Field Speech Recognition

A Chinese Speech Recognition System Based on Articulatory Features

Research on Chinese Speech Emotion Recognition Based on Deep Neural Network and Acoustic Features