Joint Training ResCNN-based Voice Activity Detection with Speech Enhancement.

Tianjiao Xu,Hui Zhang,Xueliang Zhang
DOI: https://doi.org/10.1109/apsipaasc47483.2019.9023101
2019-01-01
Abstract:Voice activity detection (VAD) is considered as a solved problem in noise-free condition, but it is still a challenging task in low signal-to-noise ratio (SNR) noisy conditions. Intuitively, reducing noise will improve the VAD. Therefore, in this study, we introduce a speech enhancement module to reduce noise. Specifically, a convolutional recurrent neural network (CRN) based encoder-decoder speech enhancement module is trained to reduce noise. Then the low-dimensional features code from its encoder together with the raw spectrum of noisy speech are feed into a deep residual convolutional neural network (ResCNN) based VAD module. The speech enhancement and VAD modules are connected and trained jointly. To balance the training speed of the two modules, an empirical dynamic gradient balance strategy is proposed. Experimental results show that the proposed joint-training method has obvious advantages in generalization ability.
What problem does this paper attempt to address?