Abstract:Target speaker extraction (TSE) is a technique for isolating a target speaker's voice from mixed speech using auxiliary features associated with the target speaker. This approach addresses the cocktail party problem and is generally considered more promising for practical applications than conventional speech separation methods. Although academic research in this area has achieved high accuracy and evaluation scores on public datasets, most models exhibit significantly reduced performance in real-world noisy or reverberant conditions. To address this limitation, we propose a novel TSE model, X-CrossNet, which leverages CrossNet as its backbone. CrossNet is a speech separation network specifically optimized for challenging noisy and reverberant environments, achieving state-of-the-art performance in tasks such as speaker separation under these conditions. Additionally, to enhance the network's ability to capture and utilize auxiliary features of the target speaker, we integrate a Cross-Attention mechanism into the global multi-head self-attention (GMHSA) module within each CrossNet block. This facilitates more effective integration of target speaker features with mixed speech features. Experimental results show that our method performs superior separation on the WSJ0-2mix and WHAMR! datasets, demonstrating strong robustness and stability.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the performance degradation of **Target Speaker Extraction (TSE)** in noisy and reverberant environments. Specifically, although existing TSE models have achieved high accuracy and evaluation scores on public datasets, their performance drops significantly in practical applications, especially in noisy or reverberant conditions. To address this limitation, the authors propose a new TSE model - X - CrossNet. ### 1. Research Background and Problem Description In real - world application scenarios, the speech signals captured by microphones usually contain multiple speakers, noise, and reverberation, which pose great challenges to tasks such as speech recognition. To meet these challenges, researchers have proposed various speech enhancement techniques, one of which is **Target Speaker Extraction (TSE)**. TSE aims to separate the voice of a specific target speaker from the mixed speech by using auxiliary features related to the target speaker. ### 2. Limitations of Existing Methods Although academic research has made significant progress in the field of TSE, most existing models perform poorly in the noisy or reverberant real - world environments. For example: - **Blind Source Separation (BSS)** : Although BSS performs well on benchmark datasets, its performance is difficult to guarantee in practical application scenarios because the number of speakers is unpredictable. - **Traditional TSE models** : These models work well under ideal conditions, but their performance drops significantly in complex environments (such as noisy or reverberant environments). ### 3. Innovations of X - CrossNet To solve the above problems, the authors propose X - CrossNet, and its main innovations include: - **Using CrossNet as the backbone network** : CrossNet is a speech separation network optimized specifically for noisy and reverberant environments and has already performed well in similar tasks. - **Introducing the cross - attention mechanism** : By integrating the cross - attention mechanism in the Global Multi - Head Self - Attention (GMHSA) module of each CrossNet block, the network's ability to capture and utilize the auxiliary features of the target speaker is enhanced. - **Improved structural design** : X - CrossNet extends the functions of CrossNet with minimal parameter addition and without introducing new integration structures, thereby improving the robustness and stability of the TSE task in complex acoustic conditions. ### 4. Experimental Verification The experimental results show that X - CrossNet outperforms existing methods on the WSJ0 - 2mix and WHAMR! datasets, especially in noisy and reverberant environments, demonstrating strong robustness and stability. ### Summary The core objective of this paper is to solve the performance degradation problem of TSE in complex environments by proposing the X - CrossNet model, thereby improving the feasibility and effectiveness of TSE in practical applications.

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Gated Cross-Attention for Universal Speaker Extraction: Toward Real-World Applications

CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

SMMA-Net: An Audio Clue-Based Target Speaker Extraction Network with Spectrogram Matching and Mutual Attention.

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Multi-Level Speaker Representation for Target Speaker Extraction

AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Binaural Selective Attention Model for Target Speaker Extraction

Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

Self-attention Based Speaker Recognition Using Cluster-Range Loss

MSFNet: Multi-Scale Fusion Network for Brain-Controlled Speaker Extraction

Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation