Abstract:We propose a novel data-driven approach to single-channel speech separation based on deep neural networks (DNNs) to directly model the highly nonlinear relationship between speech features of a mixed signal containing a target speaker and other interfering speakers. We focus our discussion on a semisupervised mode to separate speech of the target speaker from an unknown interfering speaker, which is more flexible than the conventional supervised mode with known information of both the target and interfering speakers. Two key issues are investigated. First, we propose a DNN architecture with dual outputs of the features of both the target and interfering speakers, which is shown to achieve a better generalization capability than that with output features of only the target speaker. Second, we propose using a set of multiple DNNs, each intending to be signal-noise-dependent (SND), to cope with the difficulty that one single general DNN could not well accommodate all the speaker mixing variabilities at different signal-to-noise ratio (SNR) levels. Experimental results on the speech separation challenge (SSC) data demonstrate that our proposed framework achieves better separation results than other conventional approaches in a supervised or semisupervised mode. SND-DNNs could also yield significant performance improvements over a general DNN for speech separation in low SNR cases. Furthermore, for automatic speech recognition (ASR) following speech separation, this purely front-end processing with a single set of speaker-independent ASR acoustic models, achieves a relative word error rate (WER) reduction of 11.6% over a state-of-the-art separation and recognition system where a complicated joint back-end decoding framework with multiple sets of speaker-dependent ASR acoustic models needs to be implemented. When speaker-adaptive ASR acoustic models for the target speakers are adopted for the enhanced signals, another 12.1% WER reduction over our best speaker-independent ASR system is achieved.

A New Neural Beamformer for Multi-channel Speech Separation

DFBNet: Deep Neural Network Based Fixed Beamformer for Multi-channel Speech Separation

Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation

Neural Spatio-Temporal Beamformer for Target Speech Separation

MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation

Masking-based Neural Beamformer for Multichannel Speech Enhancement

ADL-MVDR: All deep learning MVDR beamformer for target speech separation

Deep Learning Based Speech Beamforming

Dual-path Transformer Based Neural Beamformer for Target Speech Extraction

Cracking the cocktail party problem by multi-beam deep attractor network

Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction

Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

Adaptive Beamforming Based on Interference-Plus-Noise Covariance Matrix Reconstruction for Speech Separation

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

Deep Beamforming for Speech Enhancement and Speaker Localization with an Array Response-Aware Loss Function

A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks.

Beamforming and Deep Models Integrated Multi-talker Speech Separation

Multi-channel end-to-end neural network for speech enhancement, source localization, and voice activity detection

Deep Ad-hoc Beamforming Based on Speaker Extraction for Target-Dependent Speech Separation