Abstract:Target speaker extraction (TSE) which has the capability to directly extract desired speech given enrollment utterances of the target speaker has attracted more and more attention for its potential applications in solving the cocktail-party problem. Despite the considerable progress made by existing time-domain methods, which have become the dominant approach for TSE, these methods often significantly degrade their performance under more realistic conditions. This paper proposes an innovative approach in the time-frequency (T-F) domain, namely X-TF-GridNet, which uses complex spectrum mapping to extract the real and imaginary (RI) components of the target speech. Specifically, the TF-GridNet block was designed to serve as the primary speaker extractor module. Our proposed method boasts two key extensions: first, a U 2-Net style network adeptly extracts robust fixed speaker embeddings, which could efficiently capture and represent target speaker information. Second, an adaptive embedding fusion (AEA) mechanism ensures the effective utilization of target speaker information, which makes the backbone extractor focus on the speech of interest. Additionally, we also introduced a multi-task learning framework, comprising two distinct loss functions, to explicitly enhance both the discriminative speaker embeddings for the reference speech and the overall quality of the target speech. We conducted extensive ablation studies and quantitative comparisons against previous TSE methods on both the WSJ0-2mix and its noisy and reverberant counterparts. Our proposed method achieved a commendable SI-SDR of 19.7 dB with a moderate model size on the WSJ0-2mix dataset, and the SI-SDR can be improved to 20.7 dB with a larger model. Experimental results demonstrated that, compared with existing time-domain approaches, our proposed method not only achieved competitive performance across multiple objective metrics but also mitigated speaker confusion errors under more challenging conditions, including various interferences such as noises and reverberation.

Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures

Speaker-conditioned Target Speaker Extraction based on Customized LSTM Cells

Variants of LSTM cells for single-channel speaker-conditioned target speaker extraction

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism

On Time Domain Conformer Models for Monaural Speech Separation in Noisy Reverberant Acoustic Environments

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

Speakerfilter-Pro: an improved target speaker extractor combines the time domain and frequency domain

Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain

X-Tf-Gridnet: A Time-Frequency Domain Target Speaker Extraction Network with Adaptive Speaker Embedding Fusion

Target Speaker Extraction with Attention Feature Fusion and Feedback Mechanism

Speaker Extraction with Detection of Presence and Absence of Target Speakers

Dual-Channel Target Speaker Extraction Based on Conditional Variational Autoencoder and Directional Information

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

Binaural Selective Attention Model for Target Speaker Extraction

New Insights on Target Speaker Extraction

Unified Audio Visual Cues for Target Speaker Extraction

Speaker Conditioning of Acoustic Models Using Affine Transformation for Multi-Speaker Speech Recognition

A Target Speaker Separation Neural Network with Joint-Training