Abstract:Speech separation is the key to many speech backend tasks, like multi-speaker speech recognition. In recent years, with the development and aid of deep learning technology, many single-channel speech separation models have shown good performance in weak reverberant environment. However, with the presence of reverberation, the multi-channel speech separation model still has greater advantages. Among them, the deep neural network (DNN) based beamformers (also known as neural beamformers) have achieved significant improvements in separation quality. The current neural beamformers can’t jointly optimize beamforming layers and DNN layers when using the prior knowledge of the existing beamforming algorithms, which may make the model unable to obtain the optimal separation performance. In order to solve this problem, this paper employs a set of beamformers that uniformly sample the space as a learning module in the neural network, and the initial values of their coefficients are determined by the existing maximum directivity factor (DF) beamformer. Furthermore, to obtain beam representations of source signals when their directions are unknown, a cross-attention mechanism is introduced. The experimental results show that in the separation task with reverberation, the proposed method has better performance than the current state-of-the-art temporal neural beamformer filter-and-sum network (FasNet) and several mainstream multi-channel speech separation approaches in terms of scale-invariant signal-to-noise ratio (SI-SNR), perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility measure (STOI).

Beamforming and Deep Models Integrated Multi-talker Speech Separation

A Multi-channel Speech Separation System for Unknown Number of Multiple Speakers

Location-Based Training for Multi-Channel Talker-Independent Speaker Separation

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

Cracking the cocktail party problem by multi-beam deep attractor network

Deep Ad-hoc Beamforming Based on Speaker Extraction for Target-Dependent Speech Separation

A New Neural Beamformer for Multi-channel Speech Separation

A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge

DFBNet: Deep Neural Network Based Fixed Beamformer for Multi-channel Speech Separation

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Adaptive Beamforming Based on Interference-Plus-Noise Covariance Matrix Reconstruction for Speech Separation

Deep Learning Based Speech Beamforming

MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation

Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation

Recognizing Multi-talker Speech with Permutation Invariant Training

Utterance-level Permutation Invariant Training with Discriminative Learning for Single Channel Speech Separation

Utterance-level Permutation Invariant Training with Latency-controlled BLSTM for Single-channel Multi-talker Speech Separation

Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks

Improving Speaker Discrimination of Target Speech Extraction With Time-Domain Speakerbeam

On Permutation Invariant Training For Speech Source Separation