Abstract:Although great progress has been made in automatic speech recognition, significant performance degradation still exists in noisy environments. Our previous work has demonstrated the superior noise robustness of very deep convolutional neural networks (VDCNN). Based on our work on VDCNNs, this paper proposes a more advanced model referred to as the very deep convolutional residual network (VDCRN). This new model incorporates batch normalization and residual learning, showing more robustness than previous VDCNNs. Then, to alleviate the mismatch between the training and testing conditions, model adaptation and adaptive training are developed and compared for the new VDCRN. This paper focuses on factor aware training (FAT) and cluster adaptive training (CAT). For FAT, a unified framework is explored. For CAT, two schemes are first explored to construct the bases in the canonical model; furthermore, a factorized version of CAT is designed to address multiple nonspeech variabilities in one model. Finally, a complete multipass system is proposed to achieve the best system performance in the noisy scenarios. The proposed new approaches are evaluated on three different tasks: Aurora4 (simulated data with additive noise and channel distortion), CHiME4 (both simulated and real data with additive noise and reverberation), and the AMI meeting transcription task (real data with significant reverberation). The evaluation not only includes different noisy conditions, but also covers both simulated and real noisy data. The experiments show that the new VDCRN is more robust, and the adaptation on this model can further significantly reduce the word error rate (WER). The proposed best architecture obtains consistent and very large improvements on all tasks compared to the baseline VDCNN or long short-term memory. Particularly, on Aurora4 a new milestone 5.67% WER is achieved by only improving acoustic modeling.

A Comparative Study of Robustness of Deep Learning Approaches for VAD

Deep Learning Approaches for Voice Activity Detection

Adaptive Very Deep Convolutional Residual Network for Noise Robust Speech Recognition

Multi-task Joint-Learning for Robust Voice Activity Detection

The Robustness Study of Multiple Kernel Learning Approaches for VAD

Noise Robust Speech Recognition on Aurora4 by Humans and Machines.

A Universal VAD Based on Jointly Trained Deep Neural Networks.

An improved noise-robust voice activity detector based on hidden semi-Markov models

Robust Voice Activity Detection Using a Masked Auditory Encoder Based Convolutional Neural Network.

AUC Optimization for Deep Learning Based Voice Activity Detection.

Sparse Representation with Optimized Learned Dictionary for Robust Voice Activity Detection

DNN-based Voice Activity Detection for Speaker Recognition

A Robust and Lightweight Voice Activity Detection Algorithm for Speech Enhancement at Low Signal-to-noise Ratio

sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks

Denoising Deep Neural Networks Based Voice Activity Detection

Voice activity detection based on speech enhancement method

Phase Aware Deep Neural Network For Noise Robust Voice Activity Detection

Multi-task Joint-Learning of Deep Neural Networks for Robust Speech Recognition

Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection

End-to-End Speaker-Dependent Voice Activity Detection

A Robust, Real-Time Voice Activity Detection Algorithm for Embedded Mobile Devices.