Abstract:Monaural speech enhancement aims to remove background noise from noisy speech signals captured by a single microphone. In recent years, several cross-domain monaural speech enhancement methods are developed to leverage both waveform and harmonic information. However, these methods fall short in fully capturing the dependencies between the time domain and time-frequency (T-F) domain, as well as in harnessing the benefits of the target decoupling strategy. This paper proposes a causal encoder-decoder-based Triple-branch Cross-domain Fusion Network (TCF-Net), which effectively processes speeches by leveraging both time domain and T-F domain features. The proposed approach enables the parallel recovery of magnitude and phase information to alleviate the compensation problem between them. TCF-Net forms a triple-branch network by collaboratively reconstructing the enhanced spectrum with a complex spectrum branch and a magnitude spectrum branch, while incorporating time-domain information with a waveform compensation branch. To fully leverage the information from three domains, Triple-domain Fusion Modules (TFMs) are inserted in each intermediate layer of the model to extract and merge the information from two T-F domain branches and one time domain branch. The TFMs generate masks to progressively compensate for the magnitude of the two T-F domain branches and promote information interaction, further restoring the magnitude of the clean speech. Experimental results demonstrate that TCF-Net outperforms state-of-the-art (SOTA) cross-domain methods and target decoupling methods under causal configuration in all evaluation metrics, which validates the importance of the proposed cross-domain information fusion strategy and target decoupling strategy.

Cross Domain Optimization for Speech Enhancement: Parallel or Cascade?

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Decoupling-style monaural speech enhancement with a triple-branch cross-domain fusion network

Cross-domain Single-channel Speech Enhancement Model with Bi-projection Fusion Module for Noise-robust ASR

A Novel Target Decoupling Framework Based on Waveform-Spectrum Fusion Network for Monaural Speech Enhancement

Innovative Directional Encoding in Speech Processing: Leveraging Spherical Harmonics Injection for Multi-Channel Speech Enhancement

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

CLOSING THE GAP BETWEEN TIME-DOMAIN MULTI-CHANNEL SPEECH ENHANCEMENT ON REAL AND SIMULATION CONDITIONS

CRA-DIFFUSE: IMPROVED CROSS-DOMAIN SPEECH ENHANCEMENT BASED ON DIFFUSION MODEL WITH T-F DOMAIN PRE-DENOISING

Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

CompNet: Complementary Network for Single-Channel Speech Enhancement.

Speech Enhancement with Integration of Neural Homomorphic Synthesis and Spectral Masking.

Improved Speech Separation with Time-and-Frequency Cross-Domain Feature Selection

Sixty Years of Frequency-Domain Monaural Speech Enhancement: From Traditional to Deep Learning Methods

ForkNet: Simultaneous Time and Time-Frequency Domain Modeling for Speech Enhancement

DCHT: Deep Complex Hybrid Transformer for Speech Enhancement

Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition

Coarse-to-fine Optimization for Speech Enhancement

DMF-Net: A decoupling-style multi-band fusion model for full-band speech enhancement

FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement