Abstract:We propose a novel data-driven approach to single-channel speech separation based on deep neural networks (DNNs) to directly model the highly nonlinear relationship between speech features of a mixed signal containing a target speaker and other interfering speakers. We focus our discussion on a semisupervised mode to separate speech of the target speaker from an unknown interfering speaker, which is more flexible than the conventional supervised mode with known information of both the target and interfering speakers. Two key issues are investigated. First, we propose a DNN architecture with dual outputs of the features of both the target and interfering speakers, which is shown to achieve a better generalization capability than that with output features of only the target speaker. Second, we propose using a set of multiple DNNs, each intending to be signal-noise-dependent (SND), to cope with the difficulty that one single general DNN could not well accommodate all the speaker mixing variabilities at different signal-to-noise ratio (SNR) levels. Experimental results on the speech separation challenge (SSC) data demonstrate that our proposed framework achieves better separation results than other conventional approaches in a supervised or semisupervised mode. SND-DNNs could also yield significant performance improvements over a general DNN for speech separation in low SNR cases. Furthermore, for automatic speech recognition (ASR) following speech separation, this purely front-end processing with a single set of speaker-independent ASR acoustic models, achieves a relative word error rate (WER) reduction of 11.6% over a state-of-the-art separation and recognition system where a complicated joint back-end decoding framework with multiple sets of speaker-dependent ASR acoustic models needs to be implemented. When speaker-adaptive ASR acoustic models for the target speakers are adopted for the enhanced signals, another 12.1% WER reduction over our best speaker-independent ASR system is achieved.

Dual-Path Rnn For Long Recording Speech Separation

Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation

Dual-Path Modeling for Long Recording Speech Separation in Meetings

DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement

Efficient Monaural Speech Separation with Multiscale Time-Delay Sampling

Embedding Recurrent Layers with Dual-Path Strategy in a Variant of Convolutional Network for Speaker-Independent Speech Separation

Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network

On the Design and Training Strategies for RNN-based Online Neural Speech Separation Systems

La Furca: Iterative Context-Aware End-to-End Monaural Speech Separation Based on Dual-Path Deep Parallel Inter-Intra Bi-LSTM with Attention.

A convolutional recurrent neural network with attention framework for speech separation in monaural recordings

LaFurca: Iterative Refined Speech Separation Based on Context-Aware Dual-Path Parallel Bi-LSTM

Dual-Path Modeling with Memory Embedding Model for Continuous Speech Separation

Multi-channel Conversational Speaker Separation via Neural Diarization

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

LaFurca: Iterative Multi-Stage Refined End-to-End Monaural Speech Separation Based on Context-Aware Dual-Path Deep Parallel Inter-Intra Bi-LSTM

Deep Encoder/decoder Dual-Path Neural Network for Speech Separation in Noisy Reverberation Environments

A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks.

Continuous speech separation using speaker inventory for long recording

Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR

A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition

Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation