Abstract:While recent progresses in neural network approaches to single-channel speech separation, or more generally the cocktail party problem, achieved significant improvement, their performance for complex mixtures is still not satisfactory. In this work, we propose a novel multi-channel framework for multi-talker separation. In the proposed model, an input multi-channel mixture signal is firstly converted to a set of beamformed signals using fixed beam patterns. For this beamforming, we propose to use differential beamformers as they are more suitable for speech separation. Then each beamformed signal is fed into a single-channel anchored deep attractor network to generate separated signals. And the final separation is acquired by post selecting the separating output for each beams. To evaluate the proposed system, we create a challenging dataset comprising mixtures of 2, 3 or 4 speakers. Our results show that the proposed system largely improves the state of the art in speech separation, achieving 11.5 dB, 11.76 dB and 11.02 dB average signal-to-distortion ratio improvement for 4, 3 and 2 overlapped speaker mixtures, which is comparable to the performance of a minimum variance distortionless response beamformer that uses oracle location, source, and noise information. We also run speech recognition with a clean trained acoustic model on the separated speech, achieving relative word error rate (WER) reduction of 45.76\%, 59.40\% and 62.80\% on fully overlapped speech of 4, 3 and 2 speakers, respectively. With a far talk acoustic model, the WER is further reduced.

A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge

A Two-stage Single-channel Speaker-dependent Speech Separation Approach for Chime-5 Challenge.

Deep Ad-hoc Beamforming Based on Speaker Extraction for Target-Dependent Speech Separation

On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones

Channel selection using neural network posterior probability for speech recognition with distributed microphone arrays in everyday environments

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Cracking the cocktail party problem by multi-beam deep attractor network

Acoustic Modeling for Multi-Array Conversational Speech Recognition in the Chime-6 Challenge

Single-Channel Multi-Speaker Separation using Deep Clustering

Separating Voices from Multiple Sound Sources Using 2D Microphone Array

A Multi-channel Speech Separation System for Unknown Number of Multiple Speakers

Beamforming and Deep Models Integrated Multi-talker Speech Separation

A Unified DNN Approach to Speaker-Dependent Simultaneous Speech Enhancement and Speech Separation in Low SNR Environments

Acoustic Model Ensembling Using Effective Data Augmentation for CHiME-5 Challenge.

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

Speaker and Direction Inferred Dual-channel Speech Separation

Space-and-speaker-aware Acoustic Modeling with Effective Data Augmentation for Recognition of Multi-Array Conversational Speech

Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies