Abstract:Typical speech separation systems usually operate in the time-frequency (T-F) domain by enhancing the magnitude response and leaving the phase response unaltered. Recent studies, however, suggest that phase is important for perceptual quality, leading many researchers to consider magnitude and phase spectrum enhancements. The complex-valued Fourier spectrum and real-valued shifted real spectrum (SRS) of a discrete-time speech signal sequence carry information both in magnitude and phase functions, which can reconstruct the speech signal without loss. In this paper, we propose novel methods called the pcIRM and pSRSM to solve the problem of speaker-independent monaural source separation and to recover the phase of source speech signal. Specifically, the pcIRM achieves the complex ideal ratio mask (cIRM) estimation based on the complex-valued phase-encoded Fourier spectrogram, and the pSRSM obtains the shifted real spectrum mask (SRSM) estimation based on the real-valued phase-encoded shifted real spectrum, both of which are creatively implemented with the deep bi-directional Long Short-Term Memory networks. Furthermore, the merging of these masking estimation methods and the utterance-level permutation invariant training (uPIT) approach has been proved to be an effective way to improve speech separation. Typically, the uPIT addresses the label ambiguity problem well, which is the major barrier for speaker-independent multi-talker source separation. Finally, we report separation results for the proposed methods and compare them with that of the state-of-the-art methods, which are evaluated with the WSJ0-2mix datasets. Extensive experiment results demonstrate the advantages of our proposed pcIRM method in terms of the signal-to-distortion ratio (SDR) and the perceptual evaluation of speech quality (PESQ) metrics, and the pSRSM method obtains comparable performance to the pcIRM in the opposite gender speakers source separation circumstance with smaller model complexity.

Speech Separation Based on Sound Localization and Auditory Masking Effect

Localization Based Stereo Speech Source Separation Using Probabilistic Time-Frequency Masking and Deep Neural Networks

Localization Based Stereo Speech Separation Using Deep Networks.

Reverberant Speech Separation with Probabilistic Time-Frequency Masking for B-format Recordings.

A Blind Separation Algorithm of Speech Mixtures Base on Time-Frequency Masking

Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design

A Blind Separation Algorithm of Speech Signal Based on Spatial Cues

Real-time binaural speech separation with preserved spatial cues

Underdetermined Blind Separation of Delayed Sound Source in the Time-frequency Domain

Cepstral Smoothing of Spectral Masks for Acoustic Vector-Sensor Based Convolutive Speech Separation

Incorporating Phase-Encoded Spectrum Masking into Speaker-Independent Monaural Source Separation

Experiments on Blind Speech Separations

A Blind Separation Method of Instantaneous Speech Signal Via Independent Components Analysis

Speech Enhancement based on Human Auditory Masking Properties under Non-stationary Environments

Speech separation based on reliable binaural cues with two-stage neural network in noisy-reverberant environments

Combined Manipulations of the Perceived Location and Spatial Extent of the Speech-Target Image Predominantly Affect Speech-on-speech Masking

Dual-Channel Speech Separation Using Interaural Time Difference with Generalized Gaussian Mixture Model

An Improved Method for Speech Enhancement Based on Human Auditory Masking Properties

Speaker and Direction Inferred Dual-channel Speech Separation

The Effect of Perceived Spatial Separation on Informational Masking of Chinese Speech.

Synergistic Optimization Based Binaural Time-Frequency Masking for Speech Source Localization.