Abstract:Most speech separation studies in monaural channel use only a single type of network, and the separation effect is typically not satisfactory, posing difficulties for high quality speech separation. In this study, we propose a convolutional recurrent neural network with an attention (CRNN-A) framework for speech separation, fusing advantages of two networks together. The proposed separation framework uses a convolutional neural network (CNN) as the front-end of a recurrent neural network (RNN), alleviating the problem that a sole RNN cannot effectively learn the necessary features. This framework makes use of the translation invariance provided by CNN to extract information without modifying the original signals. Within the supplemented CNN, two different convolution kernels are designed to capture information in both the time and frequency domains of the input spectrogram. After concatenating the time-domain and the frequency-domain feature maps, the feature information of speech is exploited through consecutive convolutional layers. Finally, the feature map learned from the front-end CNN is combined with the original spectrogram and is sent to the back-end RNN. Further, the attention mechanism is further incorporated, focusing on the relationship among different feature maps. The effectiveness of the proposed method is evaluated on the standard dataset MIR-1K and the results prove that the proposed method outperforms the baseline RNN and other popular speech separation methods, in terms of GNSDR (gloabl normalised source-to-distortion ratio), GSIR (global source-to-interferences ratio), and GSAR (gloabl source-to-artifacts ratio). In summary, the proposed CRNN-A framework can effectively combine the advantages of CNN and RNN, and further optimise the separation performance via the attention mechanism. The proposed framework can shed a new light on speech separation, speech enhancement, and other related fields.

Convolutional Recurrent Neural Network with Attention for 3D Speech Enhancement

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

Environment-Dependent Attention-Driven Recurrent Convolutional Neural Network for Robust Speech Enhancement

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

A Nested U-Net with Efficient Channel Attention and D3Net for Speech Enhancement

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks.

A convolutional recurrent neural network with attention framework for speech separation in monaural recordings

Multi-Loss Convolutional Network with Time-Frequency Attention for Speech Enhancement

Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

Parallel Gated Neural Network With Attention Mechanism For Speech Enhancement

Hybrid Dilated and Recursive Recurrent Convolution Network for Time-Domain Speech Enhancement

A Recursive Network with Dynamic Attention for Monaural Speech Enhancement

Combining Multi-Perspective Attention Mechanism With Convolutional Networks for Monaural Speech Enhancement

U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention

Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition

A Fully Convolutional Neural Network for Speech Enhancement

Inplace Gated Convolutional Recurrent Neural Network For Dual-channel Speech Enhancement

2D-to-2d Mask Estimation for Speech Enhancement Based on Fully Convolutional Neural Network