Voice Activity Detection Based on Time-Delay Neural Networks

Ye Bai,Jiangyan Yi,Jianhua Tao,Zhengqi Wen,Bin Liu
DOI: https://doi.org/10.1109/apsipaasc47483.2019.9023262
2019-01-01
Abstract:Voice activity detection (VAD) is an important preprocessing part of many speech applications. Context information is important for VAD. Time-delay neural networks (TDNNs) capture long context information with a few parameters. This paper investigates a TDNN based VAD framework. A simple chunk based decision method is proposed to smooth raw posteriors and decide border points of utterances. To evaluate decision performance, a metric intersection-over-union (IoU) is introduced from image object detection. The experiment results are evaluated on Wall Street Journal (WSJ0) corpus. Frame classification performance is measured by area under the curve (AUC) and equal error rate (EER). Compared with long short-term memory baseline, the TDNN based system achieves a 41.26% EER relative reduction on average in matched noise condition, and relative improvement of average AUC is 3.82%. Proposed decision method achieves an 18.74% IoU relative improvement on average compared with moving average method on average.
What problem does this paper attempt to address?