Abstract:We propose a novel data-driven approach to single-channel speech separation based on deep neural networks (DNNs) to directly model the highly nonlinear relationship between speech features of a mixed signal containing a target speaker and other interfering speakers. We focus our discussion on a semisupervised mode to separate speech of the target speaker from an unknown interfering speaker, which is more flexible than the conventional supervised mode with known information of both the target and interfering speakers. Two key issues are investigated. First, we propose a DNN architecture with dual outputs of the features of both the target and interfering speakers, which is shown to achieve a better generalization capability than that with output features of only the target speaker. Second, we propose using a set of multiple DNNs, each intending to be signal-noise-dependent (SND), to cope with the difficulty that one single general DNN could not well accommodate all the speaker mixing variabilities at different signal-to-noise ratio (SNR) levels. Experimental results on the speech separation challenge (SSC) data demonstrate that our proposed framework achieves better separation results than other conventional approaches in a supervised or semisupervised mode. SND-DNNs could also yield significant performance improvements over a general DNN for speech separation in low SNR cases. Furthermore, for automatic speech recognition (ASR) following speech separation, this purely front-end processing with a single set of speaker-independent ASR acoustic models, achieves a relative word error rate (WER) reduction of 11.6% over a state-of-the-art separation and recognition system where a complicated joint back-end decoding framework with multiple sets of speaker-dependent ASR acoustic models needs to be implemented. When speaker-adaptive ASR acoustic models for the target speakers are adopted for the enhanced signals, another 12.1% WER reduction over our best speaker-independent ASR system is achieved.

Robust Spatial Filtering Network for Separating Speech in the Direction of Interest

Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation

SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation

Multi-channel Speech Separation Using Spatially Selective Deep Non-linear Filters

Complex Neural Spatial Filter: Enhancing Multi-channel Target Speech Separation in Complex Domain

Speaker and Direction Inferred Dual-channel Speech Separation

Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations.

A convolutional recurrent neural network with attention framework for speech separation in monaural recordings

A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks.

An Efficient Speech Separation Network Based on Recurrent Fusion Dilated Convolution and Channel Attention

Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in High-Order Latent Domain

Filter-Recovery Network for Multi-Speaker Audio-Visual Speech Separation

Speech Separation Based on Signal-Noise-dependent Deep Neural Networks for Robust Speech Recognition

Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR

Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network

Speech separation of a target speaker based on deep neural networks

Time-Domain Mapping with Convolution Networks for End-to-End Monaural Speech Separation

An End-to-End Speech Separation Method Based on Features of Two Domains

A Unified DNN Approach to Speaker-Dependent Simultaneous Speech Enhancement and Speech Separation in Low SNR Environments

Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets

Multi-channel Multi-frame ADL-MVDR for Target Speech Separation