Abstract:We propose a novel data-driven approach to single-channel speech separation based on deep neural networks (DNNs) to directly model the highly nonlinear relationship between speech features of a mixed signal containing a target speaker and other interfering speakers. We focus our discussion on a semisupervised mode to separate speech of the target speaker from an unknown interfering speaker, which is more flexible than the conventional supervised mode with known information of both the target and interfering speakers. Two key issues are investigated. First, we propose a DNN architecture with dual outputs of the features of both the target and interfering speakers, which is shown to achieve a better generalization capability than that with output features of only the target speaker. Second, we propose using a set of multiple DNNs, each intending to be signal-noise-dependent (SND), to cope with the difficulty that one single general DNN could not well accommodate all the speaker mixing variabilities at different signal-to-noise ratio (SNR) levels. Experimental results on the speech separation challenge (SSC) data demonstrate that our proposed framework achieves better separation results than other conventional approaches in a supervised or semisupervised mode. SND-DNNs could also yield significant performance improvements over a general DNN for speech separation in low SNR cases. Furthermore, for automatic speech recognition (ASR) following speech separation, this purely front-end processing with a single set of speaker-independent ASR acoustic models, achieves a relative word error rate (WER) reduction of 11.6% over a state-of-the-art separation and recognition system where a complicated joint back-end decoding framework with multiple sets of speaker-dependent ASR acoustic models needs to be implemented. When speaker-adaptive ASR acoustic models for the target speakers are adopted for the enhanced signals, another 12.1% WER reduction over our best speaker-independent ASR system is achieved.

A Theory on Deep Neural Network Based Vector-to-Vector Regression With an Illustration of Its Expressive Power in Speech Enhancement

An Experimental Study on Speech Enhancement Based on Deep Neural Networks

A regression approach to speech enhancement based on deep neural networks

Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement

Auxiliary Features from Laser-Doppler Vibrometer Sensor for Deep Neural Network Based Robust Speech Recognition

A Unified Speaker-Dependent Speech Separation and Enhancement System Based on Deep Neural Networks.

Tensor-to-Vector Regression for Multi-channel Speech Enhancement based on Tensor-Train Network

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning

A Maximum Likelihood Approach to Deep Neural Network Based Speech Dereverberation

Single-Channel Speech Enhancement with Deep Complex U-Networks and Probabilistic Latent Space Models

A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks.

Independent Vector Analysis with Deep Neural Network Source Priors

Structured Discriminative Models Using Deep Neural-Network Features.

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

On Mean Absolute Error for Deep Neural Network Based Vector-to-Vector Regression

Towards speech enhancement using a variational U-Net architecture

A variance modeling framework based on variational autoencoders for speech enhancement

Deep neural networks based speaker modeling at different levels of phonetic granularity

An Analysis of the Expressiveness of Deep Neural Network Architectures Based on Their Lipschitz Constants

Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU