Abstract:Typical speech separation systems usually operate in the time-frequency (T-F) domain by enhancing the magnitude response and leaving the phase response unaltered. Recent studies, however, suggest that phase is important for perceptual quality, leading many researchers to consider magnitude and phase spectrum enhancements. The complex-valued Fourier spectrum and real-valued shifted real spectrum (SRS) of a discrete-time speech signal sequence carry information both in magnitude and phase functions, which can reconstruct the speech signal without loss. In this paper, we propose novel methods called the pcIRM and pSRSM to solve the problem of speaker-independent monaural source separation and to recover the phase of source speech signal. Specifically, the pcIRM achieves the complex ideal ratio mask (cIRM) estimation based on the complex-valued phase-encoded Fourier spectrogram, and the pSRSM obtains the shifted real spectrum mask (SRSM) estimation based on the real-valued phase-encoded shifted real spectrum, both of which are creatively implemented with the deep bi-directional Long Short-Term Memory networks. Furthermore, the merging of these masking estimation methods and the utterance-level permutation invariant training (uPIT) approach has been proved to be an effective way to improve speech separation. Typically, the uPIT addresses the label ambiguity problem well, which is the major barrier for speaker-independent multi-talker source separation. Finally, we report separation results for the proposed methods and compare them with that of the state-of-the-art methods, which are evaluated with the WSJ0-2mix datasets. Extensive experiment results demonstrate the advantages of our proposed pcIRM method in terms of the signal-to-distortion ratio (SDR) and the perceptual evaluation of speech quality (PESQ) metrics, and the pSRSM method obtains comparable performance to the pcIRM in the opposite gender speakers source separation circumstance with smaller model complexity.

PGSS: Pitch-Guided Speech Separation.

Single-channel speech separation integrating pitch information based on a multi task learning framework

Speaker and Direction Inferred Dual-channel Speech Separation

Speech Separation based on Contrastive Learning and Deep Modularization

Incorporating Phase-Encoded Spectrum Masking into Speaker-Independent Monaural Source Separation

Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

Speaker Separation Using Speaker Inventories and Estimated Speech

SPGM: Prioritizing Local Features for enhanced speech separation performance

A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge

Listen and Look: Audio–Visual Matching Assisted Speech Source Separation

A Multi-channel Speech Separation System for Unknown Number of Multiple Speakers

A Two-stage Single-channel Speaker-dependent Speech Separation Approach for Chime-5 Challenge.

Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation

Two-Microphones Speech Separation Using Generalized Gaussian Mixture Model

Dual-Channel Speech Separation by Sub-Segmental Directional Statistics

Utterance-level Permutation Invariant Training with Discriminative Learning for Single Channel Speech Separation

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

A Neural State-Space Modeling Approach to Efficient Speech Separation

Mixture to Mixture: Leveraging Close-talk Mixtures as Weak-supervision for Speech Separation

Cracking the cocktail party problem by multi-beam deep attractor network