Abstract:We propose a novel data-driven approach to single-channel speech separation based on deep neural networks (DNNs) to directly model the highly nonlinear relationship between speech features of a mixed signal containing a target speaker and other interfering speakers. We focus our discussion on a semisupervised mode to separate speech of the target speaker from an unknown interfering speaker, which is more flexible than the conventional supervised mode with known information of both the target and interfering speakers. Two key issues are investigated. First, we propose a DNN architecture with dual outputs of the features of both the target and interfering speakers, which is shown to achieve a better generalization capability than that with output features of only the target speaker. Second, we propose using a set of multiple DNNs, each intending to be signal-noise-dependent (SND), to cope with the difficulty that one single general DNN could not well accommodate all the speaker mixing variabilities at different signal-to-noise ratio (SNR) levels. Experimental results on the speech separation challenge (SSC) data demonstrate that our proposed framework achieves better separation results than other conventional approaches in a supervised or semisupervised mode. SND-DNNs could also yield significant performance improvements over a general DNN for speech separation in low SNR cases. Furthermore, for automatic speech recognition (ASR) following speech separation, this purely front-end processing with a single set of speaker-independent ASR acoustic models, achieves a relative word error rate (WER) reduction of 11.6% over a state-of-the-art separation and recognition system where a complicated joint back-end decoding framework with multiple sets of speaker-dependent ASR acoustic models needs to be implemented. When speaker-adaptive ASR acoustic models for the target speakers are adopted for the enhanced signals, another 12.1% WER reduction over our best speaker-independent ASR system is achieved.

Dynamic Slimmable Network for Speech Separation

Low-Latency Deep Clustering For Speech Separation

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

On Data Sampling Strategies for Training Neural Network Speech Separation Models

Towards Real-Time Single-Channel Speech Separation in Noisy and Reverberant Environments

Scaling strategies for on-device low-complexity source separation with Conv-Tasnet

Efficient time-domain speech separation using short encoded sequence network

Rethinking the Separation Layers in Speech Separation Networks

Ultra Low Complexity Deep Learning Based Noise Suppression

Real-time Speech Enhancement and Separation with a Unified Deep Neural Network for Single/Dual Talker Scenarios

A Neural State-Space Modeling Approach to Efficient Speech Separation

End-to-end Networks for Supervised Single-channel Speech Separation

A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks.

An Efficient Speech Separation Network Based on Recurrent Fusion Dilated Convolution and Channel Attention

Glmsnet: single channel speech separation framework in noisy and reverberant environments

Efficient, Cluster-Informed, Deep Speech Separation with Cross-Cluster Information in AD-HOC Wireless Acoustic Sensor Networks

A Neural State-Space Model Approach to Efficient Speech Separation

Dynamic Slimmable Denoising Network

Multi-layer encoder-decoder time-domain single channel speech separation

Online Binaural Speech Separation of Moving Speakers With a Wavesplit Network

Small-footprint slimmable networks for keyword spotting