Abstract:Conventional time–frequency (TF) domain source separation methods mainly focus on predicting TF-masks or speech spectrums, where complex ideal ratio mask (cIRM) is an effective target for speech enhancement and separation. However, some recent studies employ a real-valued network, such as a general convolutional neural network (CNN) and a recurrent neural network (RNN), to predict a complex-valued mask or a spectrogram target, leading to the unbalanced training results of real and imaginary parts. In this paper, to estimate the complex-valued target more accurately, a novel U-shaped complex network for the complex signal approximation (uCSA) method is proposed. The uCSA is an adaptive front-end time-domain separation method, which tackles the monaural source separation problem in three ways. First, we design and implement a complex U-shaped network architecture comprising well-defined complex-valued encoder and decoder blocks, as well as complex-valued bidirectional Long Short-Term Memory (BLSTM) layers, to process complex-valued operations. Second, the cIRM is the training target of our uCSA method, optimized by signal approximation (SA), which takes advantage of both real and imaginary components of the complex-valued spectrum. Third, we re-formulate STFT and inverse STFT into derivable formats, and the model is trained with the scale-invariant source-to-noise ratio (SI-SNR) loss, achieving end-to-end training of the speech source separation task. Moreover, the proposed uCSA models are evaluated on the WSJ0-2mix datasets, which is a valid corpus commonly used by many supervised speech separation methods. Extensive experimental results indicate that our proposed method obtains state-of-the-art performance on the basis of the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) metrics.

Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

A Dual-Channel End-to-End Speech Enhancement Method Using Complex Operations in the Time Domain

DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement.

DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement

Monaural Speech Enhancement Using a Multi-Branch Temporal Convolutional Network

An Attention-augmented Fully Convolutional Neural Network for Monaural Speech Enhancement

Convolutional gated recurrent unit networks based real-time monaural speech enhancement

S-DCCRN: Super Wide Band DCCRN with Learnable Complex Feature for Speech Enhancement

Monaural Speech Enhancement with Deep Residual-Dense Lattice Network and Attention Mechanism in the Time Domain

A Complex Neural Network Adaptive Beamforming for Multi-channel Speech Enhancement in Time Domain

Monaural Speech Enhancement Using Deep Multi-Branch Residual Network with 1-D Causal Dilated Convolutions

Multi-channel end-to-end neural network for speech enhancement, source localization, and voice activity detection

Coarse-Grained Attention Fusion with Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition

Two Heads Are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement.

A Two-Stage Deep Neural Network with Bounded Complex Ideal Ratio Masking for Monaural Speech Enhancement

End-to-End Monaural Speech Separation with a Deep Complex U-Shaped Network

Multi-Loss Convolutional Network with Time-Frequency Attention for Speech Enhancement

D2Former: A Fully Complex Dual-Path Dual-Decoder Conformer Network using Joint Complex Masking and Complex Spectral Mapping for Monaural Speech Enhancement

Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement

Parallel Gated Neural Network With Attention Mechanism For Speech Enhancement