WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

Shuai Wang,Ke Zhang,Shaoxiong Lin,Junjie Li,Xuefei Wang,Meng Ge,Jianwei Yu,Yanmin Qian,Haizhou Li
2024-09-24
Abstract:Target speaker extraction (TSE) focuses on isolating the speech of a specific target speaker from overlapped multi-talker speech, which is a typical setup in the cocktail party problem. In recent years, TSE draws increasing attention due to its potential for various applications such as user-customized interfaces and hearing aids, or as a crutial front-end processing technologies for subsequential tasks such as speech recognition and speaker recongtion. However, there are currently few open-source toolkits or available pre-trained models for off-the-shelf usage. In this work, we introduce WeSep, a toolkit designed for research and practical applications in TSE. WeSep is featured with flexible target speaker modeling, scalable data management, effective on-the-fly data simulation, structured recipes and deployment support. The toolkit is publicly avaliable at \url{<a class="link-external link-https" href="https://github.com/wenet-e2e/WeSep" rel="external noopener nofollow">this https URL</a>.}
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges of Target Speaker Extraction (TSE) in practical applications. Specifically, TSE aims to separate the voice signal of a specific target speaker from multi - speaker mixed audio. This problem is similar to the cocktail party problem, that is, in a noisy environment, humans can focus on the voice of a particular speaker. ### Main problems and challenges 1. **Lack of open - source tools and pre - trained models**: - The currently available open - source tools and pre - trained models are very limited, which restricts the wide application and development of TSE technology. 2. **Insufficient generalization performance**: - Most TSE research uses synthetic datasets, which may not generalize well to the complex audio environments in the real world. - Improving the generalization performance for unknown speakers requires more advanced speaker - modeling techniques. 3. **Requirement for large - scale data processing**: - In the era of large - scale models, it is crucial to utilize rich online media resources. However, before using these resources for tasks such as speech synthesis, they need to be processed and filtered. TSE can play an important role in these data - processing pipelines. ### Solutions To address the above challenges, the author proposes an open - source toolkit named WeSep, specifically for the TSE task. The main features of WeSep include: - **Flexible target - speaker modeling**: It supports multiple mainstream models and plans to integrate more powerful models. - **Scalable data - management mechanism**: Through the Unified IO (UIO) framework, it can efficiently handle large - scale datasets. - **Real - time data simulation**: It allows users to use single - speaker audio for real - time mixing without pre - mixing data, thereby improving the robustness and performance of the model. - **Structured experimental configuration and deployment support**: It provides detailed experimental configurations and an easy - to - deploy model - export function. Through these features, WeSep aims to provide a powerful and flexible tool platform for TSE research and practical applications.