CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

Vahid Ahmadi Kalkhorani,DeLiang Wang
2024-03-06
Abstract:We introduce CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are single - channel and multi - channel speaker separation and speech enhancement under reverberation and noise conditions. Specifically, the paper improves the performance of speaker separation by introducing a new deep neural network architecture - CrossNet. The following are the specific problems and solutions mentioned in the paper: ### 1. **Problem Description** - **Background Noise and Reverberation**: In practical application scenarios, background noise and reverberation will seriously affect the quality of speech signals, making speaker separation and speech enhancement difficult. - **Long - Sequence Processing**: The performance of existing methods declines when processing long - time speech sequences, especially in the single - channel case. - **Positional Encoding Problem**: Traditional methods based on positional encoding are not effective when dealing with sequence lengths beyond the training distribution. ### 2. **Solutions** To meet the above challenges, the paper proposes CrossNet, which has the following main features: - **Global Multi - Head Self - Attention Module (GMHSA)**: Captures global, cross - band, and narrow - band correlations, so as to better handle complex time - frequency domain information. - **Random Chunk Positional Encoding (RCPE)**: Solves the deficiencies of traditional positional encoding methods in processing long sequences and improves the generalization ability of the model for long sequences. - **Cross - band Module**: Captures the correlations between different frequency bands, further enhancing the robustness of the model. - **Narrow - band Module**: Focuses on capturing the information of adjacent frequency bands and enhances the learning of local features. ### 3. **Experimental Verification** The paper conducts extensive experiments on multiple datasets (such as WSJ0 - 2mix, WHAMR! and SMS - WSJ) to verify the effectiveness and superiority of CrossNet. The experimental results show that CrossNet not only performs well under various conditions, but also has a faster training speed and less computational cost compared with other methods. ### 4. **Innovations** - **Improved Positional Encoding Method**: RCPE effectively solves the positional encoding problem in long - sequence processing. - **Efficient Architecture Design**: By reducing the number of parameters and optimizing computational efficiency, CrossNet can reduce the demand for computational resources while maintaining high performance. ### Summary By proposing the CrossNet architecture, this paper aims to solve the single - channel and multi - channel speaker separation problems in reverberation and noise environments, and verifies its effectiveness and superiority through a series of experiments.