Abstract:Time-frequency (T-F) masking is an effective method for stereo speech source separation. However, reliable estimation of the T-F mask from sound mixtures is a challenging task, especially when room reverberations are present in the mixtures. In this paper, we propose a new stereo speech separation system where deep neural networks are used to generate soft T-F mask for separation. More specifically, the deep neural network, which is composed of two sparse autoencoders and a softmax regression, is used to estimate the orientations of the dominant source at each T-F unit, based on low-level features, such as mixing vector (MV), interaural level, and phase difference (IPD/ILD). The dataset for training the networks was generated by the convolution of binaural room impulse responses (RIRs) and clean speech signals positioned in different angles with respect to the sensors. With the training dataset, we use unsupervised learning to extract high-level features from low-level features and use supervised learning to find the nonlinear functions between high-level features and the orientations of dominant source. By using the trained networks, the probability that each T-F unit belongs to different sources (target and interferers) can be estimated based on the localization cues which is further used to generate the soft mask for source separation. Experiments based on real binaural RIRs and TIMIT dataset are provided to show the performance of the proposed system for reverberant speech mixtures, as compared with a model-based T-F masking technique proposed recently.

Deep Learning Based Binaural Speech Separation in Reverberant Environments

Binaural Reverberant Speech Separation Based on Deep Neural Networks

Localization Based Stereo Speech Separation Using Deep Networks.

Localization Based Stereo Speech Source Separation Using Probabilistic Time-Frequency Masking and Deep Neural Networks

Research on Speech Separation Technology Based on Deep Learning

Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks

Deep Encoder/decoder Dual-Path Neural Network for Speech Separation in Noisy Reverberation Environments

Deep Neural Network Based Audio Source Separation

Listening and Grouping: an Online Autoregressive Approach for Monaural Speech Separation

Real-time binaural speech separation with preserved spatial cues

Speech separation based on reliable binaural cues with two-stage neural network in noisy-reverberant environments

Supervised Speech Separation Based on Deep Learning: An Overview

Deep Learning Based Speech Separation Via NMF-Style Reconstructions.

Boosting Spatial Information for Deep Learning Based Multichannel Speaker-Independent Speech Separation in Reverberant Environments.

Beamformed Feature for Learning-based Dual-channel Speech Separation

Deep Learning for Binaural Sound Source Localization with Low Signal-to-noise Ratio

Applications of Deep Learning in Supervised Speech Separation

Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation

A Deep Ensemble Learning Method for Monaural Speech Separation.

A Multichannel Learning-Based Approach for Sound Source Separation in Reverberant Environments

Multi-Target Ensemble Learning for Monaural Speech Separation