Abstract: Speech separation refers to extracting each individual speech source in a given mixed signal. Recent advancements in speech separation and ongoing research in this area, have made these approaches as promising techniques for pre-processing of naturalistic audio streams. After incorporating deep learning techniques into speech separation, performance on these systems is improving faster. The initial solutions introduced for deep learning based speech separation analyzed the speech signals into time-frequency domain with STFT; and then encoded mixed signals were fed into a deep neural network based separator. Most recently, new methods are introduced to separate waveform of the mixed signal directly without analyzing them using STFT. Here, we introduce a unified framework to include both spectrogram and waveform separations into a single structure, while being only different in the kernel function used to encode and decode the data; where, both can achieve competitive performance. This new framework provides flexibility; in addition, depending on the characteristics of the data, or limitations of the memory and latency can set the hyper-parameters to flow in a pipeline of the framework which fits the task properly. We extend single-channel speech separation into multi-channel framework with end-to-end training of the network while optimizing the speech separation criterion (i.e., Si-SNR) directly. We emphasize on how tied kernel functions for calculating spatial features, encoder, and decoder in multi-channel framework can be effective. We simulate spatialized reverberate data for both WSJ0 and LibriSpeech corpora here, and while these two sets of data are different in the matter of size and duration, the effect of capturing shorter and longer dependencies of previous/+future samples are studied in detail. We report SDR, Si-SNR and PESQ to evaluate the performance of developed solutions.

An Online Speaker-aware Speech Separation Approach Based on Time-domain Representation

Source-Aware Context Network for Single-Channel Multi-Speaker Speech Separation.

Listening and Grouping: an Online Autoregressive Approach for Monaural Speech Separation

A Unified Framework for Speech Separation

Speaker and Direction Inferred Dual-channel Speech Separation

Localization Based Stereo Speech Source Separation Using Probabilistic Time-Frequency Masking and Deep Neural Networks

Real-time Speech Enhancement and Separation with a Unified Deep Neural Network for Single/Dual Talker Scenarios

SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability

Localization Based Stereo Speech Separation Using Deep Networks.

Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism

Single-channel Multi-speakers Speech Separation Based on Isolated Speech Segments.

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Lightweight Target Speaker Separation Network Based on Joint Training

A Deep Ensemble Learning Method for Monaural Speech Separation.

A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments

Low-Latency Deep Clustering For Speech Separation

Time domain audio visual speech separation

Speaker-aware target speaker enhancement by jointly learning with speaker embedding extraction

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition