Abstract:Target speaker extraction (TSE) has become an attractive research topic in recent years. However, TSE under the underdetermined conditions is still a challenge. In this paper, we deal with a dual-channel TSE problem under underdetermined conditions. Geometric source separation (GSS) is used to be a solution to the TSE problem, but the performance of conventional GSS methods is limited under underdetermined conditions because of the lack of a powerful source model. We propose a dual-channel TSE method with the combined capabilities of target selection based on geometric constraints, more powerful source modeling, and nonlinear postprocessing. A geometric constraint (GC) on the target direction of arrival (DOA) is applied to select the target, and two conditional variational autoencoders (CVAEs) are used to model a single speaker's speech and interference mixture speech. For postprocessing, an ideal ratio timefrequency (TF) mask estimated from the separated interference mixture speech is used to extract the target speaker's speech. Moreover, to overcome the impact of DOA estimation errors, we improve the objective function so that the target DOA information can be modified. The experimental results demonstrate that the proposed method achieves 6.24 dB and 8.37 dB improvements compared with the baseline method in terms of signal-to-distortion ratio (SDR) and source-to-interference ratio (SIR), respectively, under medium reverberation for 470 ms. Furthermore, through the analysis of experimental results, we found that the improvement method is robust against DOA estimation errors.

Generation-Based Target Speech Extraction with Speech Discretization and Vocoder.

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

DENSE: Dynamic Embedding Causal Target Speech Extraction

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

Multi-Level Speaker Representation for Target Speaker Extraction

Target Sound Extraction with Variable Cross-modality Clues

Target Speaker Extraction with Curriculum Learning

USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction

Probing Self-supervised Learning Models with Target Speech Extraction

DDTSE: Discriminative Diffusion Model for Target Speech Extraction

Improving curriculum learning for target speaker extraction with synthetic speakers

Dual-Channel Target Speaker Extraction Based on Conditional Variational Autoencoder and Directional Information

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information