A Comparative Study of Robustness of Deep Learning Approaches for VAD

Sibo Tong,Hao Gu,Kai Yu
DOI: https://doi.org/10.1109/icassp.2016.7472768
2016-01-01
Abstract:Voice activity detection (VAD) is an important step for real-world automatic speech recognition (ASR) systems. Deep learning approaches, such as DNN, RNN or CNN, have been widely used in model-based VAD. Although they have achieved success in practice, they are developed on different VAD tasks separately. Whilst VAD performance under noisy conditions, especially with unseen noise or very low SNR, are of great interest, there has no robustness comparison of different deep learning approaches so far. In this paper, to learn the robustness property, VAD models based on DNN, LSTM and CNN are thoroughly compared at both frame and segment level under various noisy conditions on Aurora 4, a commonly used speech corpus with rich noises. To improve the robustness of deep learning based VAD models, a new noise-aware training (NAT) approach is also proposed. Experiments show that LSTM-based VAD is most robust but the performance degrades dramatically in the conditions with unseen noise or diverse SNR. By incorporating NAT, significant performance gains can be obtained in these conditions.
What problem does this paper attempt to address?