SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation

Changsheng Quan,Xiaofei Li
2023-12-22
Abstract:This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet. In the short-time Fourier transform (STFT) domain, the proposed network performs end-to-end speech enhancement. It is mainly composed of interleaved narrow-band and cross-band blocks to respectively exploit narrow-band and cross-band spatial information. The narrow-band blocks process frequencies independently, and use self-attention mechanism and temporal convolutional layers to respectively perform spatial-feature-based speaker clustering and temporal smoothing/filtering. The cross-band blocks process frames independently, and use full-band linear layer and frequency convolutional layers to respectively learn the correlation between all frequencies and adjacent frequencies. Experiments are conducted on various simulated and real datasets, and the results show that 1) the proposed network achieves the state-of-the-art performance on almost all tasks; 2) the proposed network suffers little from the spectral generalization problem; and 3) the proposed network is indeed performing speaker clustering (demonstrated by attention maps).
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to design a neural network for jointly performing multi-channel speech separation, denoising, and dereverberation. Specifically, the paper proposes a neural network named SpatialNet, which extensively utilizes spatial information to achieve these tasks. In the Short-Time Fourier Transform (STFT) domain, SpatialNet performs end-to-end speech enhancement. ### Main Contributions 1. **Proposed New Cross-Band Blocks**: Compared to previous narrow-band networks, the paper introduces new cross-band blocks to better model spatial information across frequency bands. 2. **Extended Network Capabilities**: The network is capable of not only speech separation but also simultaneous denoising and dereverberation, and it has been evaluated in multiple experiments. 3. **Superior Performance**: Experimental results show that SpatialNet achieves state-of-the-art performance in almost all tasks, whether they are individual or joint tasks. ### Background and Motivation In practical applications, microphone array signal processing is widely used in scenarios such as hearing aids, robots, and smart home devices. However, real recordings are inevitably affected by environmental noise, room reverberation, and interfering speech signals. Suppressing these interferences can improve speech quality and the accuracy of automatic speech recognition (ASR). Traditional speech enhancement methods mainly rely on spatial information while ignoring the spectral content of the signal. This paper designs a neural network to fully utilize spatial information in multi-channel recordings, thereby achieving better speech enhancement results. ### Method Overview SpatialNet consists of narrow-band blocks and cross-band blocks: 1. **Narrow-Band Blocks**: - Process each frequency independently, with all frequencies sharing the same network parameters. - Use Multi-Head Self-Attention (MHSA) and temporal convolution layers for speaker clustering based on spatial features and temporal smoothing/filtering, respectively. - The main functions of narrow-band blocks include: - **Multi-Head Self-Attention Module**: Collects/separates components of the same/different speakers by calculating the similarity of spatial vectors at the same STFT frequency. - **Temporal Convolutional Feed-Forward Network (T-ConvFFN)**: Inserts a temporal convolution layer between two linear layers for local smoothing/averaging operations and modeling reverberation. 2. **Cross-Band Blocks**: - Process each time frame independently, with all time frames sharing the same network. - Use frequency convolution layers and full-band linear modules to learn the correlation between adjacent frequencies and cross-band spatial information. - The main functions of cross-band blocks include: - **Frequency Convolution Module**: Models the correlation between adjacent frequencies. - **Full-Band Linear Module**: Utilizes the correlation of cross-band spatial features to improve the accuracy of modeling the spatial features of the target speech. ### Experimental Results Experiments were conducted on multiple simulated and real datasets, and the results show that: 1. SpatialNet achieves state-of-the-art performance in almost all tasks. 2. SpatialNet performs well in the spectral generalization problem. 3. Attention maps show that SpatialNet indeed performs speaker clustering. ### Conclusion SpatialNet effectively achieves multi-channel speech separation, denoising, and dereverberation by combining narrow-band and cross-band spatial information. Experimental results validate the effectiveness and superiority of this method.