Abstract:Target speaker extraction (TSE) focuses on isolating the speech of a specific target speaker from overlapped multi-talker speech, which is a typical setup in the cocktail party problem. In recent years, TSE draws increasing attention due to its potential for various applications such as user-customized interfaces and hearing aids, or as a crutial front-end processing technologies for subsequential tasks such as speech recognition and speaker recongtion. However, there are currently few open-source toolkits or available pre-trained models for off-the-shelf usage. In this work, we introduce WeSep, a toolkit designed for research and practical applications in TSE. WeSep is featured with flexible target speaker modeling, scalable data management, effective on-the-fly data simulation, structured recipes and deployment support. The toolkit is publicly avaliable at \url{<a class="link-external link-https" href="https://github.com/wenet-e2e/WeSep" rel="external noopener nofollow">this https URL</a>.}

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges of Target Speaker Extraction (TSE) in practical applications. Specifically, TSE aims to separate the voice signal of a specific target speaker from multi - speaker mixed audio. This problem is similar to the cocktail party problem, that is, in a noisy environment, humans can focus on the voice of a particular speaker. ### Main problems and challenges 1. **Lack of open - source tools and pre - trained models**: - The currently available open - source tools and pre - trained models are very limited, which restricts the wide application and development of TSE technology. 2. **Insufficient generalization performance**: - Most TSE research uses synthetic datasets, which may not generalize well to the complex audio environments in the real world. - Improving the generalization performance for unknown speakers requires more advanced speaker - modeling techniques. 3. **Requirement for large - scale data processing**: - In the era of large - scale models, it is crucial to utilize rich online media resources. However, before using these resources for tasks such as speech synthesis, they need to be processed and filtered. TSE can play an important role in these data - processing pipelines. ### Solutions To address the above challenges, the author proposes an open - source toolkit named WeSep, specifically for the TSE task. The main features of WeSep include: - **Flexible target - speaker modeling**: It supports multiple mainstream models and plans to integrate more powerful models. - **Scalable data - management mechanism**: Through the Unified IO (UIO) framework, it can efficiently handle large - scale datasets. - **Real - time data simulation**: It allows users to use single - speaker audio for real - time mixing without pre - mixing data, thereby improving the robustness and performance of the model. - **Structured experimental configuration and deployment support**: It provides detailed experimental configurations and an easy - to - deploy model - export function. Through these features, WeSep aims to provide a powerful and flexible tool platform for TSE research and practical applications.

WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

3D-Speaker-Toolkit: An Open-Source Toolkit for Multimodal Speaker Verification and Diarization

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Multi-Level Speaker Representation for Target Speaker Extraction

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

WASE: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

Probing Self-supervised Learning Models with Target Speech Extraction

Target conversation extraction: Source separation using turn-taking dynamics

Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain

Improving Target Sound Extraction with Timestamp Information

X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network

SMMA-Net: An Audio Clue-Based Target Speaker Extraction Network with Spectrogram Matching and Mutual Attention.