Abstract:Target speaker extraction (TSE) has become an attractive research topic in recent years. However, TSE under the underdetermined conditions is still a challenge. In this paper, we deal with a dual-channel TSE problem under underdetermined conditions. Geometric source separation (GSS) is used to be a solution to the TSE problem, but the performance of conventional GSS methods is limited under underdetermined conditions because of the lack of a powerful source model. We propose a dual-channel TSE method with the combined capabilities of target selection based on geometric constraints, more powerful source modeling, and nonlinear postprocessing. A geometric constraint (GC) on the target direction of arrival (DOA) is applied to select the target, and two conditional variational autoencoders (CVAEs) are used to model a single speaker's speech and interference mixture speech. For postprocessing, an ideal ratio timefrequency (TF) mask estimated from the separated interference mixture speech is used to extract the target speaker's speech. Moreover, to overcome the impact of DOA estimation errors, we improve the objective function so that the target DOA information can be modified. The experimental results demonstrate that the proposed method achieves 6.24 dB and 8.37 dB improvements compared with the baseline method in terms of signal-to-distortion ratio (SDR) and source-to-interference ratio (SIR), respectively, under medium reverberation for 470 ms. Furthermore, through the analysis of experimental results, we found that the improvement method is robust against DOA estimation errors.

New Insights on Target Speaker Extraction

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

Target Speech Extraction Based on Blind Source Separation and X-vector-based Speaker Selection Trained with Data Augmentation

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures

Dual-Channel Target Speaker Extraction Based on Conditional Variational Autoencoder and Directional Information

USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction

Probing Self-supervised Learning Models with Target Speech Extraction

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

Target Speech Extraction with Pre-trained Self-supervised Learning Models

Improving Source Separation via Multi-Speaker Representations

Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

All Information is Necessary: Integrating Speech Positive and Negative Information by Contrastive Learning for Speech Enhancement

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Variants of LSTM cells for single-channel speaker-conditioned target speaker extraction

Extracting the Auditory Attention in a Dual-Speaker Scenario From EEG Using a Joint CNN-LSTM Model

Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation