Abstract:Most speech separation studies in monaural channel use only a single type of network, and the separation effect is typically not satisfactory, posing difficulties for high quality speech separation. In this study, we propose a convolutional recurrent neural network with an attention (CRNN-A) framework for speech separation, fusing advantages of two networks together. The proposed separation framework uses a convolutional neural network (CNN) as the front-end of a recurrent neural network (RNN), alleviating the problem that a sole RNN cannot effectively learn the necessary features. This framework makes use of the translation invariance provided by CNN to extract information without modifying the original signals. Within the supplemented CNN, two different convolution kernels are designed to capture information in both the time and frequency domains of the input spectrogram. After concatenating the time-domain and the frequency-domain feature maps, the feature information of speech is exploited through consecutive convolutional layers. Finally, the feature map learned from the front-end CNN is combined with the original spectrogram and is sent to the back-end RNN. Further, the attention mechanism is further incorporated, focusing on the relationship among different feature maps. The effectiveness of the proposed method is evaluated on the standard dataset MIR-1K and the results prove that the proposed method outperforms the baseline RNN and other popular speech separation methods, in terms of GNSDR (gloabl normalised source-to-distortion ratio), GSIR (global source-to-interferences ratio), and GSAR (gloabl source-to-artifacts ratio). In summary, the proposed CRNN-A framework can effectively combine the advantages of CNN and RNN, and further optimise the separation performance via the attention mechanism. The proposed framework can shed a new light on speech separation, speech enhancement, and other related fields.

Glmsnet: single channel speech separation framework in noisy and reverberant environments

Glottal Information Based Spectral Recuperation in Multi-channel Speaker Recognition

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

A Multichannel Learning-Based Approach for Sound Source Separation in Reverberant Environments

SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation

GLD-Net: Improving Monaural Speech Enhancement by Learning Global and Local Dependency Features with GLD Block

A convolutional recurrent neural network with attention framework for speech separation in monaural recordings

A Multi-Stage Triple-Path Method for Speech Separation in Noisy and Reverberant Environments

Single-Channel Speech Enhancement Algorithm Based on ME-MGCRN in Low Signal-to-Noise Scenario

Multi-branch Learning for Noisy and Reverberant Monaural Speech Separation

Audio-visual multi-channel speech separation, dereverberation and recognition

CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

A comprehensive study of speech separation: spectrogram vs waveform separation

Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations.

Spatial-temporal Graph Based Multi-channel Speaker Verification With Ad-hoc Microphone Arrays

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Generative adversarial networks for single channel separation of convolutive mixed speech signals

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation