Abstract:Typical speech separation systems usually operate in the time-frequency (T-F) domain by enhancing the magnitude response and leaving the phase response unaltered. Recent studies, however, suggest that phase is important for perceptual quality, leading many researchers to consider magnitude and phase spectrum enhancements. The complex-valued Fourier spectrum and real-valued shifted real spectrum (SRS) of a discrete-time speech signal sequence carry information both in magnitude and phase functions, which can reconstruct the speech signal without loss. In this paper, we propose novel methods called the pcIRM and pSRSM to solve the problem of speaker-independent monaural source separation and to recover the phase of source speech signal. Specifically, the pcIRM achieves the complex ideal ratio mask (cIRM) estimation based on the complex-valued phase-encoded Fourier spectrogram, and the pSRSM obtains the shifted real spectrum mask (SRSM) estimation based on the real-valued phase-encoded shifted real spectrum, both of which are creatively implemented with the deep bi-directional Long Short-Term Memory networks. Furthermore, the merging of these masking estimation methods and the utterance-level permutation invariant training (uPIT) approach has been proved to be an effective way to improve speech separation. Typically, the uPIT addresses the label ambiguity problem well, which is the major barrier for speaker-independent multi-talker source separation. Finally, we report separation results for the proposed methods and compare them with that of the state-of-the-art methods, which are evaluated with the WSJ0-2mix datasets. Extensive experiment results demonstrate the advantages of our proposed pcIRM method in terms of the signal-to-distortion ratio (SDR) and the perceptual evaluation of speech quality (PESQ) metrics, and the pSRSM method obtains comparable performance to the pcIRM in the opposite gender speakers source separation circumstance with smaller model complexity.

Pseudo-pitch-synchronized Phase Information Extraction and Its Application for Robust Speaker Recognition

Pitch envelope based frame level score reweighed algorithm for emotion robust speaker recognition.

Incorporating Phase-Encoded Spectrum Masking into Speaker-Independent Monaural Source Separation

Multi-resolution Time Frequency Feature and Complementary Combination for Short Utterance Speaker Recognition

Speech Enhancement Based On Analysis Synthesis Framework With Improved Pitch Estimation And Spectral Envelope Enhancement

Speech Personality Recognition Based on Annotation Classification Using Log-Likelihood Distance and Extraction of Essential Audio Features.

Robust Multipitch Estimation Of Piano Sounds Using Deep Spiking Neural Networks

An Investigation of the Effectiveness of Phase for Audio Classification

Significance of relative phase features for shouted and normal speech classification

Single-channel speech separation integrating pitch information based on a multi task learning framework

Auditory Model Based Speech Feature Extraction and Its Application to Speaker Identification

Multi-Pitch Detection for Co-Channel Speech Utilizing Frequency Channel Piecewise Integration and Morphological Feedback Verification Tracking

Speaker Verification Using Simple Temporal Features and Pitch Synchronous Cepstral Coefficients

Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions

Auditory model-based speech feature extraction and its application to speaker identification

A New Robust Pitch Determination Algorithm for Telephone Speech

Noise Robust Voice Activity Detection Using Joint Phase and Magnitude Based Feature Enhancement.

An initial research: Towards accurate pitch extraction for speech synthesis based on BLSTM

Speech Enhancement Based on Analysis–Synthesis Framework with Improved Parameter Domain Enhancement

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

Speaker Identification Using MFCC Feature Extraction ANN Classification Technique