Abstract:Single-channel speech separation can be adopted in many applications. Time-frequency (T-F) masking is an effective method for single-channel speech separation. With advancements in deep learning, T-F masks have become used as a training target, achieving notable separation results. Among the numerous masks that have been proposed, the ideal binary mask (IBM), ideal ratio mask (IRM), Wiener filter (WF) and spectral magnitude mask (SMM) are commonly used and have proven effective, though their separation performance varies depending on the speech mixture and separation model. The existing approach mainly utilizes a single network to approximate the mask of the target speech. However, in mixed speech, there are segments where speech is mixed with other speech, segments where speech is mixed with silent intervals, and segments where high signal-to-noise ratio (SNR) speech is mixed due to pauses and variations in the speakers' intonation and emphasis. In this paper, we attempt to use different networks to handle speech segments containing various mixtures. In addition to the existing network, we introduce a network (using the Rectified Linear Unit as activation functions) to specifically address segments containing a mixture of speech and silence, as well as segments with high SNR speech mixtures. We conducted evaluation experiments on the speech separation of two speakers using the four aforementioned masks as training targets. The performance improvements observed in the evaluation experiments demonstrate the effectiveness of our proposed method based on the joint network compared to the conventional method based on the single network.

Auditory Feature for Monaural Speech Segregation

Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks

A DNN Parameter Mask for the Binaural Reverberant Speech Segregation

Using an Adjustment Training and a Smoothing Mask for Speech Segregation

An Auditory-Based Monaural Feature for Noisy and Reverberant Speech Enhancement

A regression approach to binaural speech segregation via deep neural network

Parameter Masks for Close Talk Speech Segregation Using Deep Neural Networks

Auditory Features For The Close Talk Speech Enhancement With Parameter Masks

Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

Performance analysis of ideal binary masks in speech enhancement

Performance of Binary Time-Frequency Masks in Low Signal to Noise Ratio Environments

Bipolar Population Threshold Encoding for Audio Recognition with Deep Spiking Neural Networks

Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets

Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations.

Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design

A Dual Microphone Speech Enhancement Method With A Smoothing Parameter Mask

Integrating Spectrotemporal Context Into Features Based On Auditory Perception For Classification-Based Speech Separation

Efficient Audio Stream Segmentation Via the Combined T-2 Statistic and Bayesian Information Criterion

Addressing Feature Imbalance in Sound Source Separation

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Speech separation based on reliable binaural cues with two-stage neural network in noisy-reverberant environments