Abstract:Monaural speech enhancement has been widely studied using real networks in the time-frequency (TF) domain. However, the input and the target are naturally complex-valued in the TF domain, a fully complex network is highly desirable for effectively learning the feature representation and modelling the sequence in the complex domain. Moreover, phase, an important factor for perceptual quality of speech, has been proved learnable together with magnitude from noisy speech using complex masking or complex spectral mapping. Many recent studies focus on either complex masking or complex spectral mapping, ignoring their performance boundaries. To address above issues, we propose a fully complex dual-path dual-decoder conformer network (D2Former) using joint complex masking and complex spectral mapping for monaural speech enhancement. In D2Former, we extend the conformer network into the complex domain and form a dual-path complex TF self-attention architecture for effectively modelling the complex-valued TF sequence. We further boost the TF feature representation in the encoder and the decoders using a dual-path learning structure by exploiting complex dilated convolutions on time dependency and complex feedforward sequential memory networks (CFSMN) for frequency recurrence. In addition, we improve the performance boundaries of complex masking and complex spectral mapping by combining the strengths of the two training targets into a joint-learning framework. As a consequence, D2Former takes fully advantages of the complex-valued operations, the dual-path processing, and the joint-training targets. Compared to the previous models, D2Former achieves state-of-the-art results on the VoiceBank+Demand benchmark with the smallest model size of 0.87M parameters.

Dual-Branch Modeling Based on State-Space Model for Speech Enhancement

A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

ForkNet: Simultaneous Time and Time-Frequency Domain Modeling for Speech Enhancement

Dual-Branch Attention-In-Attention Transformer for Single-Channel Speech Enhancement

Spiking Structured State Space Model for Monaural Speech Enhancement

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Real-Time Speech Enhancement for Mobile Communication Based on Dual-Channel Complex Spectral Mapping

Single-Channel Speech Enhancement with Deep Complex U-Networks and Probabilistic Latent Space Models

Selective State Space Model for Monaural Speech Enhancement

D2Former: A Fully Complex Dual-Path Dual-Decoder Conformer Network using Joint Complex Masking and Complex Spectral Mapping for Monaural Speech Enhancement

A dual-region speech enhancement method based on voiceprint segmentation

Double Branches and Stages Neural Network for Joint Acoustic Echo and Noise Suppression

Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling

Dbn Based Multi-Stream Models For Speech

Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones

Speech Perception Improvement Algorithm Based on a Dual-Path Long Short-Term Memory Network

Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer

A Feature Integration Network for Multi-Channel Speech Enhancement

Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation

A Dual Microphone Speech Enhancement Method With A Smoothing Parameter Mask