Improve Data Utilization with Two-stage Learning in CNN-LSTM-based Voice Activity Detection

Tianjiao Xu,Hao Li,Hui Zhang,Xueliang Zhang
DOI: https://doi.org/10.1109/APSIPAASC47483.2019.9023306
2019-01-01
Abstract:Voice activity detection (VAD) is essential for the speech signal processing system. Convolutional long short-term memory deep neural network (CLDNN), which consists of a CNN and an LSTM, has shown excellent improvement in VAD. However, the training data of the CLDNN must be sequence data because of the LSTM. To improve data utilization, we proposed a two-stage training strategy. Specifically, the first stage trains the CNN on shuffled frame-level data to get high-level feature expression, individually. The second stage trains the LSTM to model the speech continuity. We show that our method has obvious advantages in discriminative ability and generalization ability than compared approaches in different scale of training data, especially in small datasets. The proposed method achieves over 2.89% relative improvement than the original CLDNN on noise matched condition and over 1.07% on unmatched condition.
What problem does this paper attempt to address?