Speakerfilter: deep learning-based target speaker extraction using anchor speech

Shulin He,Hao Li,Xueliang Zhang
DOI: https://doi.org/10.1109/icassp40776.2020.9054222
2020-01-01
Abstract:Speaker extraction aims to separate a target speaker from multiple voices which is useful for applications, e.g. teleconference. In many practical cases, it has an opportunity to get a piece voice of the target speaker in advance, which provides useful information for speaker extraction. This paper addresses the problem of extracting the target speaker from the mixture using a short piece of anchor speech. To effectively utilize anchor speech, we propose a multi-level feature extraction and seamlessly integrate the features into a speech separation model. Experiments are conducted on the two-speaker dataset (WSJ0-mix2) which is widely used for speaker extraction. The systematic evaluation shows that the proposed method significantly outperforms the previous methods and achieves a signal-to-distortion ratio (SDR) improvement of 11.3 dB on the unprocessed mixture.
What problem does this paper attempt to address?