A Two-stage Single-channel Speaker-dependent Speech Separation Approach for Chime-5 Challenge.

Lei Sun,Jun Du,Tian Gao,Yi Fang,Feng Ma,Jia Pan,Chin-Hui Lee
DOI: https://doi.org/10.1109/icassp.2019.8683243
2019-01-01
Abstract:In this paper, we design a two-stage single-channel speaker-dependent speech separation approach for the CHiME-5 Challenge, targeting the problem of far-field and multi-talker conversational speech recognition in dinner party scenarios involving background noises, reverberations and overlapping speech. First, we make detailed analysis of the CHiME-5 data and observe problems of inaccurate human annotations and low-resource useable data for target speakers. Motivated by this, we conduct a first-stage speaker-dependent speech separation with a learning target for aggressive segregation to generate more and purer target speech data. Then a second-stage speaker-dependent speech separation with a new learning target is performed to obtain the final speech masks, which can be directly fed to back-end acoustic model. Compared with the official baseline, our proposed approach can yield an absolute word error rate reduction of 5.3%, namely from 81.3% to 76.0% in development test set. To the best of our knowledge, it is the first time to discuss a feasible method of single-channel speaker-dependent speech separation for such a challenging task although we make an assumption of oracle speaker diarization following the challenge rules. By integrating this crucial technique, our submitted systems achieved the first place of all four tasks in the CHiME-5 challenge.
What problem does this paper attempt to address?