Abstract:Target speaker extraction (TSE) has become an attractive research topic in recent years. However, TSE under the underdetermined conditions is still a challenge. In this paper, we deal with a dual-channel TSE problem under underdetermined conditions. Geometric source separation (GSS) is used to be a solution to the TSE problem, but the performance of conventional GSS methods is limited under underdetermined conditions because of the lack of a powerful source model. We propose a dual-channel TSE method with the combined capabilities of target selection based on geometric constraints, more powerful source modeling, and nonlinear postprocessing. A geometric constraint (GC) on the target direction of arrival (DOA) is applied to select the target, and two conditional variational autoencoders (CVAEs) are used to model a single speaker's speech and interference mixture speech. For postprocessing, an ideal ratio timefrequency (TF) mask estimated from the separated interference mixture speech is used to extract the target speaker's speech. Moreover, to overcome the impact of DOA estimation errors, we improve the objective function so that the target DOA information can be modified. The experimental results demonstrate that the proposed method achieves 6.24 dB and 8.37 dB improvements compared with the baseline method in terms of signal-to-distortion ratio (SDR) and source-to-interference ratio (SIR), respectively, under medium reverberation for 470 ms. Furthermore, through the analysis of experimental results, we found that the improvement method is robust against DOA estimation errors.

A Study of Multichannel Spatiotemporal Features and Knowledge Distillation on Robust Target Speaker Extraction

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

Dual-Channel Target Speaker Extraction Based on Conditional Variational Autoencoder and Directional Information

Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition

Time Difference of Arrival Estimation Exploiting Multichannel Spatio-Temporal Prediction

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp Clues

Improving Speaker Discrimination of Target Speech Extraction With Time-Domain Speakerbeam

WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures

Target conversation extraction: Source separation using turn-taking dynamics

Target Speaker Extraction for Overlapped Multi-Talker Speaker Verification

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Improving Target Sound Extraction with Timestamp Information

Multi-Level Speaker Representation for Target Speaker Extraction

X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network