Abstract:We propose a novel data-driven approach to single-channel speech separation based on deep neural networks (DNNs) to directly model the highly nonlinear relationship between speech features of a mixed signal containing a target speaker and other interfering speakers. We focus our discussion on a semisupervised mode to separate speech of the target speaker from an unknown interfering speaker, which is more flexible than the conventional supervised mode with known information of both the target and interfering speakers. Two key issues are investigated. First, we propose a DNN architecture with dual outputs of the features of both the target and interfering speakers, which is shown to achieve a better generalization capability than that with output features of only the target speaker. Second, we propose using a set of multiple DNNs, each intending to be signal-noise-dependent (SND), to cope with the difficulty that one single general DNN could not well accommodate all the speaker mixing variabilities at different signal-to-noise ratio (SNR) levels. Experimental results on the speech separation challenge (SSC) data demonstrate that our proposed framework achieves better separation results than other conventional approaches in a supervised or semisupervised mode. SND-DNNs could also yield significant performance improvements over a general DNN for speech separation in low SNR cases. Furthermore, for automatic speech recognition (ASR) following speech separation, this purely front-end processing with a single set of speaker-independent ASR acoustic models, achieves a relative word error rate (WER) reduction of 11.6% over a state-of-the-art separation and recognition system where a complicated joint back-end decoding framework with multiple sets of speaker-dependent ASR acoustic models needs to be implemented. When speaker-adaptive ASR acoustic models for the target speakers are adopted for the enhanced signals, another 12.1% WER reduction over our best speaker-independent ASR system is achieved.

Speaker Verification based on Single Channel Speech Separation

Target Speaker Extraction for Overlapped Multi-Talker Speaker Verification

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Single-Channel Multi-Speaker Separation using Deep Clustering

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Experiments on Blind Speech Separations

Lightweight Speaker Verification Using Transformation Module with Feature Partition and Fusion

Self-attention Based Speaker Recognition Using Cluster-Range Loss

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues

A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks.

Two-stage Model and Optimal SI-SNR for Monaural Multi-Speaker Speech Separation in Noisy Environment

Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order

Speaker Verification using Convolutional Neural Networks

Mixture Encoder for Joint Speech Separation and Recognition

RSKNet-MTSP: Effective and Portable Deep Architecture for Speaker Verification

Investigation Of Bottleneck Features And Multilingual Deep Neural Networks For Speaker Verification

Target Speaker Verification with Selective Auditory Attention for Single and Multi-talker Speech

Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets

A Two-stage Single-channel Speaker-dependent Speech Separation Approach for Chime-5 Challenge.

Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification