Abstract:We introduce CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are single - channel and multi - channel speaker separation and speech enhancement under reverberation and noise conditions. Specifically, the paper improves the performance of speaker separation by introducing a new deep neural network architecture - CrossNet. The following are the specific problems and solutions mentioned in the paper: ### 1. **Problem Description** - **Background Noise and Reverberation**: In practical application scenarios, background noise and reverberation will seriously affect the quality of speech signals, making speaker separation and speech enhancement difficult. - **Long - Sequence Processing**: The performance of existing methods declines when processing long - time speech sequences, especially in the single - channel case. - **Positional Encoding Problem**: Traditional methods based on positional encoding are not effective when dealing with sequence lengths beyond the training distribution. ### 2. **Solutions** To meet the above challenges, the paper proposes CrossNet, which has the following main features: - **Global Multi - Head Self - Attention Module (GMHSA)**: Captures global, cross - band, and narrow - band correlations, so as to better handle complex time - frequency domain information. - **Random Chunk Positional Encoding (RCPE)**: Solves the deficiencies of traditional positional encoding methods in processing long sequences and improves the generalization ability of the model for long sequences. - **Cross - band Module**: Captures the correlations between different frequency bands, further enhancing the robustness of the model. - **Narrow - band Module**: Focuses on capturing the information of adjacent frequency bands and enhances the learning of local features. ### 3. **Experimental Verification** The paper conducts extensive experiments on multiple datasets (such as WSJ0 - 2mix, WHAMR! and SMS - WSJ) to verify the effectiveness and superiority of CrossNet. The experimental results show that CrossNet not only performs well under various conditions, but also has a faster training speed and less computational cost compared with other methods. ### 4. **Innovations** - **Improved Positional Encoding Method**: RCPE effectively solves the positional encoding problem in long - sequence processing. - **Efficient Architecture Design**: By reducing the number of parameters and optimizing computational efficiency, CrossNet can reduce the demand for computational resources while maintaining high performance. ### Summary By proposing the CrossNet architecture, this paper aims to solve the single - channel and multi - channel speaker separation problems in reverberation and noise environments, and verifies its effectiveness and superiority through a series of experiments.

CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Cross-Talk Reduction

Glmsnet: single channel speech separation framework in noisy and reverberant environments

CompNet: Complementary Network for Single-Channel Speech Enhancement.

Speaker and Direction Inferred Dual-channel Speech Separation

Blind Source Separation Based on Improved Wave-U-Net Network

A Multichannel Learning-Based Approach for Sound Source Separation in Reverberant Environments

Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation

BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with Convolutional Cross Attention in Multi-talker Conditions

Atss-Net: Target Speaker Separation via Attention-based Neural Network

An End-to-End Speech Separation Method Based on Features of Two Domains

Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet