Abstract:Single-channel speech separation can be adopted in many applications. Time-frequency (T-F) masking is an effective method for single-channel speech separation. With advancements in deep learning, T-F masks have become used as a training target, achieving notable separation results. Among the numerous masks that have been proposed, the ideal binary mask (IBM), ideal ratio mask (IRM), Wiener filter (WF) and spectral magnitude mask (SMM) are commonly used and have proven effective, though their separation performance varies depending on the speech mixture and separation model. The existing approach mainly utilizes a single network to approximate the mask of the target speech. However, in mixed speech, there are segments where speech is mixed with other speech, segments where speech is mixed with silent intervals, and segments where high signal-to-noise ratio (SNR) speech is mixed due to pauses and variations in the speakers' intonation and emphasis. In this paper, we attempt to use different networks to handle speech segments containing various mixtures. In addition to the existing network, we introduce a network (using the Rectified Linear Unit as activation functions) to specifically address segments containing a mixture of speech and silence, as well as segments with high SNR speech mixtures. We conducted evaluation experiments on the speech separation of two speakers using the four aforementioned masks as training targets. The performance improvements observed in the evaluation experiments demonstrate the effectiveness of our proposed method based on the joint network compared to the conventional method based on the single network.

Using an Adjustment Training and a Smoothing Mask for Speech Segregation

A Dual Microphone Speech Enhancement Method With A Smoothing Parameter Mask

Parameter Masks for Close Talk Speech Segregation Using Deep Neural Networks

Auditory Feature for Monaural Speech Segregation

A Speech Enhancement Algorithm Using Computational Auditory Scene Analysis with Spectral Subtraction

A DNN Parameter Mask for the Binaural Reverberant Speech Segregation

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

Low-SNR Speech Enhancement and Separation in Driving Environment

Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks

Joint Noise and Mask Aware Training for DNN-based Speech Enhancement with SUB-band Features

Noise-Aware Speech Separation with Contrastive Learning

Speech Intelligibility Based Enhancement System Using Modified Deep Neural Network and Adaptive Multi-band Spectral Subtraction

Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets

Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

A Binaural Deep Neural Networks Parameter Mask for the Robust Automatic Speech Recognition System

Robust Front-End for Speech Recognition Based on Computational Auditory Scene Analysis and Speaker Model

Spectral-change Enhancement with Prior SNR for the Hearing Impaired

A Dual-Microphone Speech Enhancement Algorithm for Close-Talk System

A Unified DNN Approach to Speaker-Dependent Simultaneous Speech Enhancement and Speech Separation in Low SNR Environments

Assessing Level-Dependent Segmental Contribution to the Intelligibility of Speech Processed by Single-Channel Noise-Suppression Algorithms