Abstract:Time-frequency (T-F) masking is an effective method for stereo speech source separation. However, reliable estimation of the T-F mask from sound mixtures is a challenging task, especially when room reverberations are present in the mixtures. In this paper, we propose a new stereo speech separation system where deep neural networks are used to generate soft T-F mask for separation. More specifically, the deep neural network, which is composed of two sparse autoencoders and a softmax regression, is used to estimate the orientations of the dominant source at each T-F unit, based on low-level features, such as mixing vector (MV), interaural level, and phase difference (IPD/ILD). The dataset for training the networks was generated by the convolution of binaural room impulse responses (RIRs) and clean speech signals positioned in different angles with respect to the sensors. With the training dataset, we use unsupervised learning to extract high-level features from low-level features and use supervised learning to find the nonlinear functions between high-level features and the orientations of dominant source. By using the trained networks, the probability that each T-F unit belongs to different sources (target and interferers) can be estimated based on the localization cues which is further used to generate the soft mask for source separation. Experiments based on real binaural RIRs and TIMIT dataset are provided to show the performance of the proposed system for reverberant speech mixtures, as compared with a model-based T-F masking technique proposed recently.

Cepstral Smoothing of Spectral Masks for Acoustic Vector-Sensor Based Convolutive Speech Separation

Acoustic vector sensor based reverberant speech separation with probabilistic time-frequency masking

Reverberant Speech Separation with Probabilistic Time-Frequency Masking for B-format Recordings.

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Cepstral smoothing of masks for single-channel speech segregation

Acoustic vector sensor based speech source separation with mixed Gaussian-Laplacian distributions

Adaptive Beamforming Based on Interference-Plus-Noise Covariance Matrix Reconstruction for Speech Separation

Localization Based Stereo Speech Separation Using Deep Networks.

Localization Based Stereo Speech Source Separation Using Probabilistic Time-Frequency Masking and Deep Neural Networks

Independent Vector Analysis Assisted Adaptive Beamfomring for Speech Source Separation with an Acoustic Vector Sensor

Incorporation of a modified temporal cepstrum smoothing in both signal-to-noise ratio and speech presence probability estimation for speech enhancement

Speech enhancement based on estimating expected values of speech cepstra

Speech Enhancement Algorithm Based on Auditory Masking Effect and Optimal Smoothing

A New Method of Solving Permutation Problem in Blind Source Separation for Convolutive Acoustic Signals in Frequency-domain

Quasi-Blind Source Separation Algorithm for Convolutive Mixture of Speech

A Blind Speech Separation Algorithm with Strong Reverberation

A Dual Microphone Speech Enhancement Method With A Smoothing Parameter Mask

A Blind Separation Algorithm of Speech Mixtures Base on Time-Frequency Masking

Improving Separation of Harmonic Sources with Iterative Estimation of Spatial Cues

Using an Adjustment Training and a Smoothing Mask for Speech Segregation

Expectation-maximisation for Speech Source Separation Using Convolutive Transfer Function