Abstract:Two-stage pipeline is popular in speech enhancement tasks due to its superiority over traditional single-stage methods. The current two-stage approaches usually enhance the magnitude spectrum in the first stage, and further modify the complex spectrum to suppress the residual noise and recover the speech phase in the second stage. The above whole process is performed in the short-time Fourier transform (STFT) spectrum domain. In this paper, we re-implement the above second sub-process in the short-time discrete cosine transform (STDCT) spectrum domain. The reason is that we have found STDCT performs greater noise suppression capability than STFT. Additionally, the implicit phase of STDCT ensures simpler and more efficient phase recovery, which is challenging and computationally expensive in the STFT-based methods. Therefore, we propose a novel two-stage framework called the STFT-STDCT spectrum fusion network (FDFNet) for speech enhancement in cross-spectrum domain. Experimental results demonstrate that the proposed FDFNet outperforms the previous two-stage methods and also exhibits superior performance compared to other advanced systems.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are two main challenges in existing speech enhancement methods: 1. **Difficulty in explicit phase estimation**: In the short - time Fourier transform (STFT) spectral domain, accurately recovering speech phase information remains challenging. Traditional methods usually only focus on recovering the target magnitude spectrum and ignore the phase information, which limits the effectiveness of speech enhancement. 2. **Problem of residual noise suppression**: In the STFT complex spectral domain, it is very difficult to further suppress the residual noise after magnitude - spectrum enhancement. Although the existing two - stage methods roughly remove noise by enhancing the magnitude spectrum in the first stage, they are not effective in further processing the complex spectrum in the second stage to suppress residual noise and recover the speech phase. To solve these problems, the authors propose a new two - stage framework - **STFT - STDCT Spectrum Fusion Network (FDFNet)**. The main improvements of this method are: - **Introduction of short - time discrete cosine transform (STDCT)**: Compared with STFT, STDCT shows stronger ability in noise suppression, and its implicit phase information makes phase recovery simpler and more efficient. - **Cross - spectral - domain processing**: Combine traditional STFT spectral - domain processing with STDCT spectral - domain processing to achieve more effective speech enhancement. Specifically, the first stage uses STFT to enhance the magnitude spectrum, and the second stage further optimizes in the STDCT spectral domain to better suppress residual noise and recover the speech phase. Through experimental verification, FDFNet outperforms existing two - stage methods and other advanced systems in multiple evaluation metrics, especially in wideband perceptual speech quality (WB - PESQ), signal distortion (CSIG), background noise intrusiveness (CBAK), and overall audio quality (COVL). In summary, this paper aims to propose a more effective real - time speech enhancement method by combining the advantages of STFT and STDCT to overcome the deficiencies of existing methods in phase recovery and residual noise suppression.

A Two-Stage Framework in Cross-Spectrum Domain for Real-Time Speech Enhancement

Forensic Speech Enhancement Based on Two-Dimensional Fractional Fourier Transform Domain

FB-MSTCN: A Full-Band Single-Channel Speech Enhancement Method Based on Multi-Scale Temporal Convolutional Network

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Two-Stage Single-Channel Speech Enhancement with Multi-Frame Filtering

FSI-Net: A dual-stage full- and sub-band integration network for full-band speech enhancement

Heterogeneous Space Fusion and Dual-Dimension Attention: A New Paradigm for Speech Enhancement

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

A Novel Target Decoupling Framework Based on Waveform-Spectrum Fusion Network for Monaural Speech Enhancement

Two-stage unet with channel and temporal-frequency attention for multi-channel speech enhancement

Supervised Single-Channel Speech Dereverberation And Denoising Using A Two-Stage Processing

Foster Strengths and Circumvent Weaknesses: a Speech Enhancement Framework with Two-branch Collaborative Learning

An End-to-End Speech Enhancement Framework Using Stacked Multi-scale Blocks.

Know Your Enemy, Know Yourself: A Unified Two-Stage Framework for Speech Enhancement

On real-time multi-stage speech enhancement systems

FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement

Supervised Single-Channel Speech Dereverberation and Denoising Using a Two-Stage Model Based Sparse Representation.

Supervised Single Channel Dual Domains Speech Enhancement Using Sparse Non-Negative Matrix Factorization

DMF-Net: A decoupling-style multi-band fusion model for full-band speech enhancement

A Hybrid Deep-Learning Approach for Single Channel HF-SSB Speech Enhancement