Abstract:This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet. In the short-time Fourier transform (STFT) domain, the proposed network performs end-to-end speech enhancement. It is mainly composed of interleaved narrow-band and cross-band blocks to respectively exploit narrow-band and cross-band spatial information. The narrow-band blocks process frequencies independently, and use self-attention mechanism and temporal convolutional layers to respectively perform spatial-feature-based speaker clustering and temporal smoothing/filtering. The cross-band blocks process frames independently, and use full-band linear layer and frequency convolutional layers to respectively learn the correlation between all frequencies and adjacent frequencies. Experiments are conducted on various simulated and real datasets, and the results show that 1) the proposed network achieves the state-of-the-art performance on almost all tasks; 2) the proposed network suffers little from the spectral generalization problem; and 3) the proposed network is indeed performing speaker clustering (demonstrated by attention maps).

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to design a neural network for jointly performing multi-channel speech separation, denoising, and dereverberation. Specifically, the paper proposes a neural network named SpatialNet, which extensively utilizes spatial information to achieve these tasks. In the Short-Time Fourier Transform (STFT) domain, SpatialNet performs end-to-end speech enhancement. ### Main Contributions 1. **Proposed New Cross-Band Blocks**: Compared to previous narrow-band networks, the paper introduces new cross-band blocks to better model spatial information across frequency bands. 2. **Extended Network Capabilities**: The network is capable of not only speech separation but also simultaneous denoising and dereverberation, and it has been evaluated in multiple experiments. 3. **Superior Performance**: Experimental results show that SpatialNet achieves state-of-the-art performance in almost all tasks, whether they are individual or joint tasks. ### Background and Motivation In practical applications, microphone array signal processing is widely used in scenarios such as hearing aids, robots, and smart home devices. However, real recordings are inevitably affected by environmental noise, room reverberation, and interfering speech signals. Suppressing these interferences can improve speech quality and the accuracy of automatic speech recognition (ASR). Traditional speech enhancement methods mainly rely on spatial information while ignoring the spectral content of the signal. This paper designs a neural network to fully utilize spatial information in multi-channel recordings, thereby achieving better speech enhancement results. ### Method Overview SpatialNet consists of narrow-band blocks and cross-band blocks: 1. **Narrow-Band Blocks**: - Process each frequency independently, with all frequencies sharing the same network parameters. - Use Multi-Head Self-Attention (MHSA) and temporal convolution layers for speaker clustering based on spatial features and temporal smoothing/filtering, respectively. - The main functions of narrow-band blocks include: - **Multi-Head Self-Attention Module**: Collects/separates components of the same/different speakers by calculating the similarity of spatial vectors at the same STFT frequency. - **Temporal Convolutional Feed-Forward Network (T-ConvFFN)**: Inserts a temporal convolution layer between two linear layers for local smoothing/averaging operations and modeling reverberation. 2. **Cross-Band Blocks**: - Process each time frame independently, with all time frames sharing the same network. - Use frequency convolution layers and full-band linear modules to learn the correlation between adjacent frequencies and cross-band spatial information. - The main functions of cross-band blocks include: - **Frequency Convolution Module**: Models the correlation between adjacent frequencies. - **Full-Band Linear Module**: Utilizes the correlation of cross-band spatial features to improve the accuracy of modeling the spatial features of the target speech. ### Experimental Results Experiments were conducted on multiple simulated and real datasets, and the results show that: 1. SpatialNet achieves state-of-the-art performance in almost all tasks. 2. SpatialNet performs well in the spectral generalization problem. 3. Attention maps show that SpatialNet indeed performs speaker clustering. ### Conclusion SpatialNet effectively achieves multi-channel speech separation, denoising, and dereverberation by combining narrow-band and cross-band spatial information. Experimental results validate the effectiveness and superiority of this method.

SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation

Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation

Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers

Localization Based Stereo Speech Separation Using Deep Networks.

Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information

Localization Based Stereo Speech Source Separation Using Probabilistic Time-Frequency Masking and Deep Neural Networks

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Robust Spatial Filtering Network for Separating Speech in the Direction of Interest

Inter-channel Conv-TasNet for multichannel speech enhancement

Enhancing End-to-End Multi-channel Speech Separation Via Spatial Feature Learning

CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

A Feature Integration Network for Multi-Channel Speech Enhancement

A Separation and Interaction Framework for Causal Multi-Channel Speech Enhancement.

Multi-channel Speech Separation Using Spatially Selective Deep Non-linear Filters

SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability

Audio-Visual Speech Separation and Dereverberation With a Two-Stage Multimodal Network

Narrow-band Deep Filtering for Multichannel Speech Enhancement

Integrating spatial and temporal features for enhanced artifact removal in multi-channel EEG recordings

Deep Encoder/decoder Dual-Path Neural Network for Speech Separation in Noisy Reverberation Environments

MAF-Net: Multidimensional Attention Fusion Network for Multichannel Speech Separation