Speaker Diarization with Enhancing Speech for the First DIHARD Challenge

Lei Sun,Jun Du,Chao Jiang,Xueyang Zhang,Shan He,Bing Yin,Chin-Hui Lee
DOI: https://doi.org/10.21437/interspeech.2018-1742
2018-01-01
Abstract:We design a novel speaker diarization system for the first DI-HARD challenge by integrating several important modules of speech denoising, speech activity detection (SAD), i-vector design, and scoring strategy. One main contribution is the proposed long short-term memory (LSTM) based speech denoising model. By fully utilizing the diversified simulated training data and advanced network architecture using progressive multitask learning with dense structure, the denoising model demonstrates the strong generalization capability to realistic noisy environments. The enhanced speech can boost the performance for the subsequent SAD, segmentation and clustering. To the best of our knowledge, this is the first time we show significant improvements of deep learning based single-channel speech enhancement over state-of-the-art diarization systems in highly mismatch conditions. For the design of i-vector extraction, we adopt a residual convolutional neural network trained on large dataset including more than 30,000 people. Finally, by score fusion of different i-vectors based on all these techniques, our systems yield diarization error rates (DERs) of 24.56% and 36.05% on the evaluation sets of Trackl and Track2, which are both in the second place among 14 and 11 participating teams, respectively.
What problem does this paper attempt to address?