Abstract:Typical speech separation systems usually operate in the time-frequency (T-F) domain by enhancing the magnitude response and leaving the phase response unaltered. Recent studies, however, suggest that phase is important for perceptual quality, leading many researchers to consider magnitude and phase spectrum enhancements. The complex-valued Fourier spectrum and real-valued shifted real spectrum (SRS) of a discrete-time speech signal sequence carry information both in magnitude and phase functions, which can reconstruct the speech signal without loss. In this paper, we propose novel methods called the pcIRM and pSRSM to solve the problem of speaker-independent monaural source separation and to recover the phase of source speech signal. Specifically, the pcIRM achieves the complex ideal ratio mask (cIRM) estimation based on the complex-valued phase-encoded Fourier spectrogram, and the pSRSM obtains the shifted real spectrum mask (SRSM) estimation based on the real-valued phase-encoded shifted real spectrum, both of which are creatively implemented with the deep bi-directional Long Short-Term Memory networks. Furthermore, the merging of these masking estimation methods and the utterance-level permutation invariant training (uPIT) approach has been proved to be an effective way to improve speech separation. Typically, the uPIT addresses the label ambiguity problem well, which is the major barrier for speaker-independent multi-talker source separation. Finally, we report separation results for the proposed methods and compare them with that of the state-of-the-art methods, which are evaluated with the WSJ0-2mix datasets. Extensive experiment results demonstrate the advantages of our proposed pcIRM method in terms of the signal-to-distortion ratio (SDR) and the perceptual evaluation of speech quality (PESQ) metrics, and the pSRSM method obtains comparable performance to the pcIRM in the opposite gender speakers source separation circumstance with smaller model complexity.

Preserving Early Reflections to Improve Speech Quality of Reverberant Speech Separation

Automatic Auditory Streaming Restores Missing Temporal Modulations in Echoic Speech

Improve Speech Enhancement Using Perception-High-Related Time-Frequency Loss.

Early Reflections Based Speech Enhancement

On phase recovery and preserving early reflections for deep-learning speech dereverberation

ConSep: a Noise- and Reverberation-Robust Speech Separation Framework by Magnitude Conditioning

Enhanced Reverberation as Supervision for Unsupervised Speech Separation

Distortion-controlled Training for End-to-end Reverberant Speech Separation with Auxiliary Autoencoding Loss

A Multi-Stage Triple-Path Method for Speech Separation in Noisy and Reverberant Environments

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations

Frequency-domain Dereverberation on Speech Signal Using Surround Retinex

Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

Speech separation based on reliable binaural cues with two-stage neural network in noisy-reverberant environments

Incorporating Phase-Encoded Spectrum Masking into Speaker-Independent Monaural Source Separation

A Multichannel Learning-Based Approach for Sound Source Separation in Reverberant Environments

Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

Audio-visual multi-channel speech separation, dereverberation and recognition

Using Iterative Adaptation and Dynamic Mask for Child Speech Extraction under Real-World Multilingual Conditions

A comprehensive study of speech separation: spectrogram vs waveform separation