X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Chang Sun,Bo Qin
2024-11-21
Abstract:Target speaker extraction (TSE) is a technique for isolating a target speaker's voice from mixed speech using auxiliary features associated with the target speaker. This approach addresses the cocktail party problem and is generally considered more promising for practical applications than conventional speech separation methods. Although academic research in this area has achieved high accuracy and evaluation scores on public datasets, most models exhibit significantly reduced performance in real-world noisy or reverberant conditions. To address this limitation, we propose a novel TSE model, X-CrossNet, which leverages CrossNet as its backbone. CrossNet is a speech separation network specifically optimized for challenging noisy and reverberant environments, achieving state-of-the-art performance in tasks such as speaker separation under these conditions. Additionally, to enhance the network's ability to capture and utilize auxiliary features of the target speaker, we integrate a Cross-Attention mechanism into the global multi-head self-attention (GMHSA) module within each CrossNet block. This facilitates more effective integration of target speaker features with mixed speech features. Experimental results show that our method performs superior separation on the WSJ0-2mix and WHAMR! datasets, demonstrating strong robustness and stability.
Sound,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the performance degradation of **Target Speaker Extraction (TSE)** in noisy and reverberant environments. Specifically, although existing TSE models have achieved high accuracy and evaluation scores on public datasets, their performance drops significantly in practical applications, especially in noisy or reverberant conditions. To address this limitation, the authors propose a new TSE model - X - CrossNet. ### 1. Research Background and Problem Description In real - world application scenarios, the speech signals captured by microphones usually contain multiple speakers, noise, and reverberation, which pose great challenges to tasks such as speech recognition. To meet these challenges, researchers have proposed various speech enhancement techniques, one of which is **Target Speaker Extraction (TSE)**. TSE aims to separate the voice of a specific target speaker from the mixed speech by using auxiliary features related to the target speaker. ### 2. Limitations of Existing Methods Although academic research has made significant progress in the field of TSE, most existing models perform poorly in the noisy or reverberant real - world environments. For example: - **Blind Source Separation (BSS)** : Although BSS performs well on benchmark datasets, its performance is difficult to guarantee in practical application scenarios because the number of speakers is unpredictable. - **Traditional TSE models** : These models work well under ideal conditions, but their performance drops significantly in complex environments (such as noisy or reverberant environments). ### 3. Innovations of X - CrossNet To solve the above problems, the authors propose X - CrossNet, and its main innovations include: - **Using CrossNet as the backbone network** : CrossNet is a speech separation network optimized specifically for noisy and reverberant environments and has already performed well in similar tasks. - **Introducing the cross - attention mechanism** : By integrating the cross - attention mechanism in the Global Multi - Head Self - Attention (GMHSA) module of each CrossNet block, the network's ability to capture and utilize the auxiliary features of the target speaker is enhanced. - **Improved structural design** : X - CrossNet extends the functions of CrossNet with minimal parameter addition and without introducing new integration structures, thereby improving the robustness and stability of the TSE task in complex acoustic conditions. ### 4. Experimental Verification The experimental results show that X - CrossNet outperforms existing methods on the WSJ0 - 2mix and WHAMR! datasets, especially in noisy and reverberant environments, demonstrating strong robustness and stability. ### Summary The core objective of this paper is to solve the performance degradation problem of TSE in complex environments by proposing the X - CrossNet model, thereby improving the feasibility and effectiveness of TSE in practical applications.