A Two-Stage Framework in Cross-Spectrum Domain for Real-Time Speech Enhancement

Yuewei Zhang,Huanbin Zou,Jie Zhu
2024-01-19
Abstract:Two-stage pipeline is popular in speech enhancement tasks due to its superiority over traditional single-stage methods. The current two-stage approaches usually enhance the magnitude spectrum in the first stage, and further modify the complex spectrum to suppress the residual noise and recover the speech phase in the second stage. The above whole process is performed in the short-time Fourier transform (STFT) spectrum domain. In this paper, we re-implement the above second sub-process in the short-time discrete cosine transform (STDCT) spectrum domain. The reason is that we have found STDCT performs greater noise suppression capability than STFT. Additionally, the implicit phase of STDCT ensures simpler and more efficient phase recovery, which is challenging and computationally expensive in the STFT-based methods. Therefore, we propose a novel two-stage framework called the STFT-STDCT spectrum fusion network (FDFNet) for speech enhancement in cross-spectrum domain. Experimental results demonstrate that the proposed FDFNet outperforms the previous two-stage methods and also exhibits superior performance compared to other advanced systems.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two main challenges in existing speech enhancement methods: 1. **Difficulty in explicit phase estimation**: In the short - time Fourier transform (STFT) spectral domain, accurately recovering speech phase information remains challenging. Traditional methods usually only focus on recovering the target magnitude spectrum and ignore the phase information, which limits the effectiveness of speech enhancement. 2. **Problem of residual noise suppression**: In the STFT complex spectral domain, it is very difficult to further suppress the residual noise after magnitude - spectrum enhancement. Although the existing two - stage methods roughly remove noise by enhancing the magnitude spectrum in the first stage, they are not effective in further processing the complex spectrum in the second stage to suppress residual noise and recover the speech phase. To solve these problems, the authors propose a new two - stage framework - **STFT - STDCT Spectrum Fusion Network (FDFNet)**. The main improvements of this method are: - **Introduction of short - time discrete cosine transform (STDCT)**: Compared with STFT, STDCT shows stronger ability in noise suppression, and its implicit phase information makes phase recovery simpler and more efficient. - **Cross - spectral - domain processing**: Combine traditional STFT spectral - domain processing with STDCT spectral - domain processing to achieve more effective speech enhancement. Specifically, the first stage uses STFT to enhance the magnitude spectrum, and the second stage further optimizes in the STDCT spectral domain to better suppress residual noise and recover the speech phase. Through experimental verification, FDFNet outperforms existing two - stage methods and other advanced systems in multiple evaluation metrics, especially in wideband perceptual speech quality (WB - PESQ), signal distortion (CSIG), background noise intrusiveness (CBAK), and overall audio quality (COVL). In summary, this paper aims to propose a more effective real - time speech enhancement method by combining the advantages of STFT and STDCT to overcome the deficiencies of existing methods in phase recovery and residual noise suppression.