Abstract:We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral mapping, where the real and imaginary (RI) components of input signals are stacked as features to predict target RI components. We first evaluate it on monaural anechoic speaker separation. Without using data augmentation and dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard dataset for two-speaker separation. To show its robustness to noise and reverberation, we evaluate it on monaural reverberant speaker separation using the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!, and obtain state-of-the-art performance on both datasets. We then extend TF-GridNet to multi-microphone conditions through multi-microphone complex spectral mapping, and integrate it into a two-DNN system with a beamformer in between (named as MISO-BF-MISO in earlier studies), where the beamformer proposed in this paper is a novel multi-frame Wiener filter computed based on the outputs of the first DNN. State-of-the-art performance is obtained on the multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply the proposed algorithms to speech dereverberation and noisy-reverberant speech enhancement. State-of-the-art performance is obtained on a dereverberation dataset and on the dataset of the recent L3DAS22 multi-channel speech enhancement challenge.

A Beam-TFDPRNN Based Speech Separation Method in Reverberant Environments

DFBNet: Deep Neural Network Based Fixed Beamformer for Multi-channel Speech Separation

A New Neural Beamformer for Multi-channel Speech Separation

Beam-Guided TasNet: an Iterative Speech Separation Framework with Multi-Channel Output

Neural Spatio-Temporal Beamformer for Target Speech Separation

ADL-MVDR: All deep learning MVDR beamformer for target speech separation

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation

Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation

Adaptive Beamforming Based on Interference-Plus-Noise Covariance Matrix Reconstruction for Speech Separation

TFCnet: Time-Frequency Domain Corrector for Speech Separation

A Multichannel Learning-Based Approach for Sound Source Separation in Reverberant Environments

MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation

Deformable Temporal Convolutional Networks for Monaural Noisy Reverberant Speech Separation

Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising

Speech separation based on reliable binaural cues with two-stage neural network in noisy-reverberant environments

Attention-Based Beamformer For Multi-Channel Speech Enhancement

Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

A comprehensive study of speech separation: spectrogram vs waveform separation

A Multi-Stage Triple-Path Method for Speech Separation in Noisy and Reverberant Environments

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation