Generation-Based Target Speech Extraction with Speech Discretization and Vocoder.

Linfeng Yu,Wangyou Zhang,Chenpeng Du,Leying Zhang,Zheng Liang,Yanmin Qian
DOI: https://doi.org/10.1109/ICASSP48485.2024.10446418
2024-01-01
Abstract:Target speech extraction (TSE) is a task aiming at isolating the speech of a specific target speaker from an audio mixture, with the help of an auxiliary recording of that target speaker. Most existing TSE methods employ discrimination-based models to estimate the target speaker’s proportion in the mixture, but they often fail to compensate for the missing or highly corrupted frequency components in the speech signal. In contrast, the generation-based methods can naturally handle such scenarios via speech resynthesis. In this paper, we propose a novel discrete token based TSE approach by combining state-of-the-art speech discretization and vocoder techniques. By predicting a sequence of discrete tokens with the auxiliary audio and employing a vocoder that takes discrete tokens as input, the target speech can be effectively re-synthesized while eliminating interference. Our experiments conducted on the WSJ0-2mix and Libri2mix datasets demonstrate that our proposed method yields high-quality target speech without interference.
What problem does this paper attempt to address?