Abstract:We propose a novel data-driven approach to single-channel speech separation based on deep neural networks (DNNs) to directly model the highly nonlinear relationship between speech features of a mixed signal containing a target speaker and other interfering speakers. We focus our discussion on a semisupervised mode to separate speech of the target speaker from an unknown interfering speaker, which is more flexible than the conventional supervised mode with known information of both the target and interfering speakers. Two key issues are investigated. First, we propose a DNN architecture with dual outputs of the features of both the target and interfering speakers, which is shown to achieve a better generalization capability than that with output features of only the target speaker. Second, we propose using a set of multiple DNNs, each intending to be signal-noise-dependent (SND), to cope with the difficulty that one single general DNN could not well accommodate all the speaker mixing variabilities at different signal-to-noise ratio (SNR) levels. Experimental results on the speech separation challenge (SSC) data demonstrate that our proposed framework achieves better separation results than other conventional approaches in a supervised or semisupervised mode. SND-DNNs could also yield significant performance improvements over a general DNN for speech separation in low SNR cases. Furthermore, for automatic speech recognition (ASR) following speech separation, this purely front-end processing with a single set of speaker-independent ASR acoustic models, achieves a relative word error rate (WER) reduction of 11.6% over a state-of-the-art separation and recognition system where a complicated joint back-end decoding framework with multiple sets of speaker-dependent ASR acoustic models needs to be implemented. When speaker-adaptive ASR acoustic models for the target speakers are adopted for the enhanced signals, another 12.1% WER reduction over our best speaker-independent ASR system is achieved.

Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks

A regression approach to binaural speech segregation via deep neural network

A DNN Parameter Mask for the Binaural Reverberant Speech Segregation

Auditory Feature for Monaural Speech Segregation

Deep Neural Network Based Environment Sound Classification and Its Implementation on Hearing Aid App

Deep Learning Applied to Dereverberation and Sound Event Classification in Reverberant Environments

Binaural Deep Neural Network for Robust Speech Enhancement

Speech separation based on reliable binaural cues with two-stage neural network in noisy-reverberant environments

Exploiting Deep Neural Networks and Head Movements for Robust Binaural Localisation of Multiple Sources in Reverberant Environments

A Unified DNN Approach to Speaker-Dependent Simultaneous Speech Enhancement and Speech Separation in Low SNR Environments

An Auditory-Based Monaural Feature for Noisy and Reverberant Speech Enhancement

Using an Adjustment Training and a Smoothing Mask for Speech Segregation

Parameter Masks for Close Talk Speech Segregation Using Deep Neural Networks

Speech Separation Based on Signal-Noise-dependent Deep Neural Networks for Robust Speech Recognition

A Unified Speaker-Dependent Speech Separation and Enhancement System Based on Deep Neural Networks.

A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks.

End-to-End Classification of Reverberant Rooms using DNNs

An RNN-based Speech Enhancement Method for a Binaural Hearing Aid System

A Multichannel Learning-Based Approach for Sound Source Separation in Reverberant Environments

Binaural Signal Representations for Joint Sound Event Detection and Acoustic Scene Classification

Simultaneous Denoising and Dereverberation Using Deep Embedding Features