Abstract:PurposeThe current mainstream methods for single-channel speech separation generally use a feature extraction process like the short-time Fourier transform and rely on long input sequences. Thus, they do not fully utilize the information of speech features and cause signal delays in speech separation.MethodsTo achieve better performance with a lightweight model, a fully convolution end-to-end audio separation network is proposed based on the features of two domains, i.e. temporal domain channel domain. It considers not only the temporal correlation of speech signals, but also the correlation between channels in the signal feature map. At first, the end-to-end network uses a convolution process with no overlapping segments to sample and encode the speech waveform. Subsequently, it calculates the mask by convolving the encoded feature space in both time series and inter-channel dimensions. Finally, it decodes the masked feature space to restructure the waveform.ResultsThe proposed end-to-end speech separation method makes full use of the feature space information of speech signals. Meanwhile, the separation module introduces residual structure and dilation convolution, which improves separation accuracy and computational speed with fewer parameters. The experiments show that compared with the base Conv-TasNet, the proposed model improves the SI-SNR (scale-invariant source-to-noise ratio) metric by 3.1 dB on the WSJ0-Mix2 dataset.ConclusionThis paper proposes an improved speech separation algorithm. Compared with Conv-TasNet, the performance of speech separation is improved. At the same time, the algorithm inherits the lightweight property of Conv-TasNet. In the task of separating speech signals mixed with a random signal-to-noise ratio (SNR) between −5 and 5 dB, the proposed algorithm achieves a relatively high accuracy.

Single-channel Speech Separation with Non-Negative Matrix Factorization and Factorial Conditional Random Fields

Transductive Nonnegative Matrix Factorization for Semi-Supervised High-Performance Speech Separation

The Source Separation of Multi-channel Vibration Signal Based on Nonnegative Tensor Factorization

An MRF-ICA based algorithm for image separation

Deep Learning Based Speech Separation Via NMF-Style Reconstructions.

Adaptive Beamforming Based on Interference-Plus-Noise Covariance Matrix Reconstruction for Speech Separation

Deep NMF for speech separation

Experiments on Blind Speech Separations

Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation

Multichannel blind speech source separation with a disjoint constraint source model

Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking

Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network

Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations.

Determined Multichannel Blind Source Separation with Clustered Source Model

Complex Neural Spatial Filter: Enhancing Multi-channel Target Speech Separation in Complex Domain

Multi-channel Multi-frame ADL-MVDR for Target Speech Separation

Adaptive Speech Separation Based on Beamforming and Frequency Domain-Independent Component Analysis

Separation and Extraction of Compound-Fault Signal Based on Multi-Constraint Non-Negative Matrix Factorization

Deep Factorization for Speech Signal

Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets

An End-to-End Speech Separation Method Based on Features of Two Domains