Abstract:PurposeThe current mainstream methods for single-channel speech separation generally use a feature extraction process like the short-time Fourier transform and rely on long input sequences. Thus, they do not fully utilize the information of speech features and cause signal delays in speech separation.MethodsTo achieve better performance with a lightweight model, a fully convolution end-to-end audio separation network is proposed based on the features of two domains, i.e. temporal domain channel domain. It considers not only the temporal correlation of speech signals, but also the correlation between channels in the signal feature map. At first, the end-to-end network uses a convolution process with no overlapping segments to sample and encode the speech waveform. Subsequently, it calculates the mask by convolving the encoded feature space in both time series and inter-channel dimensions. Finally, it decodes the masked feature space to restructure the waveform.ResultsThe proposed end-to-end speech separation method makes full use of the feature space information of speech signals. Meanwhile, the separation module introduces residual structure and dilation convolution, which improves separation accuracy and computational speed with fewer parameters. The experiments show that compared with the base Conv-TasNet, the proposed model improves the SI-SNR (scale-invariant source-to-noise ratio) metric by 3.1 dB on the WSJ0-Mix2 dataset.ConclusionThis paper proposes an improved speech separation algorithm. Compared with Conv-TasNet, the performance of speech separation is improved. At the same time, the algorithm inherits the lightweight property of Conv-TasNet. In the task of separating speech signals mixed with a random signal-to-noise ratio (SNR) between −5 and 5 dB, the proposed algorithm achieves a relatively high accuracy.

Learning a hierarchical dictionary for single-channel speech separation

Learning a Discriminative Dictionary for Single-Channel Speech Separation

Single-channel Speech Separation Using Sequential Discriminative Dictionary Learning.

Single-channel Speech Separation Using Dictionary-updated Orthogonal Matching Pursuit and Temporal Structure Information

Dual Transform Based Joint Learning Single Channel Speech Separation Using Generative Joint Dictionary Learning

DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions

Improved Speech Separation with Time-and-Frequency Cross-Domain Feature Selection

Single channel blind source separation of vibration signals based on improved dictionary learning

Localization Based Stereo Speech Separation Using Deep Networks.

Parallel And Hierarchical Decision Making For Sparse Coding In Speech Recognition

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in High-Order Latent Domain

Discriminative structured dictionary learning with hierarchical group sparsity

Monaural Speech Enhancement Using Joint Dictionary Learning with Cross-Coherence Penalties

Underdetermined Blind Source Separation of Speech Mixtures Unifying Dictionary Learning and Sparse Representation.

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

Source-Aware Context Network for Single-Channel Multi-Speaker Speech Separation.

An End-to-End Speech Separation Method Based on Features of Two Domains

So-DAS: A Two-Step Soft-Direction-Aware Speech Separation Framework

Single Channel Source Separation Using Filterbank and 2D Sparse Matrix Factorization

Supervised Single-Channel Speech Enhancement Using Ratio Mask with Joint Dictionary Learning