Abstract:Single-channel speech separation can be adopted in many applications. Time-frequency (T-F) masking is an effective method for single-channel speech separation. With advancements in deep learning, T-F masks have become used as a training target, achieving notable separation results. Among the numerous masks that have been proposed, the ideal binary mask (IBM), ideal ratio mask (IRM), Wiener filter (WF) and spectral magnitude mask (SMM) are commonly used and have proven effective, though their separation performance varies depending on the speech mixture and separation model. The existing approach mainly utilizes a single network to approximate the mask of the target speech. However, in mixed speech, there are segments where speech is mixed with other speech, segments where speech is mixed with silent intervals, and segments where high signal-to-noise ratio (SNR) speech is mixed due to pauses and variations in the speakers' intonation and emphasis. In this paper, we attempt to use different networks to handle speech segments containing various mixtures. In addition to the existing network, we introduce a network (using the Rectified Linear Unit as activation functions) to specifically address segments containing a mixture of speech and silence, as well as segments with high SNR speech mixtures. We conducted evaluation experiments on the speech separation of two speakers using the four aforementioned masks as training targets. The performance improvements observed in the evaluation experiments demonstrate the effectiveness of our proposed method based on the joint network compared to the conventional method based on the single network.

Synergistic Optimization Based Binaural Time-Frequency Masking for Speech Source Localization.

Head-related transfer function-reserved time-frequency masking for robust binaural sound source localization

An Adaptive Method Based on Multiscale Dilated Convolutional Network for Binaural Speech Source Localization

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Bi-Direction Interaural Matching Filter and Decision Weighting Fusion for Sound Source Localization in Noisy Environments.

Multitask learning of time-frequency CNN for sound source localization

A Binaural Sound Source Localization Model Based on Time-Delay Compensation and Interaural Coherence

Binaural sound source localization based on generalized parametric model and two-layer matching strategy in complex environments

Probabilistic Binaural Multiple Sources Localization Based On Time-Delay Compensation Estimator And Clustering Analysis

A new hierarchical binaural sound source localization method based on Interaural Matching Filter

Deep and CNN Fusion Method for Binaural Sound Source Localisation

Sound Source Localization in Spherical Harmonics Domain Based on High-Order Ambisonics Signals Enhancement Neural Network

Masks Fusion with Multi-Target Learning For Speech Enhancement

Binaural Sound Localization Based on Reverberation Weighting and Generalized Parametric Mapping.

SE Territory: Monaural Speech Enhancement Meets the Fixed Virtual Perceptual Space Mapping

Robust Acoustic Localization Via Time-Delay Compensation and Interaural Matching Filter

A Time-domain End-to-End Method for Sound Source Localization Using Multi-Task Learning

A Lightweight and Real-Time Binaural Speech Enhancement Model with Spatial Cues Preservation

Speech Enhancement Based on Binaural Sound Source Localization and Cosh Measure Wiener Filtering

Multiple Sound Source Counting and Localization Based on TF-Wise Spatial Spectrum Clustering

Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets