A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and Recognition

Zilu Guo,Qing Wang,Jun Du,Jia Pan,Qing-Feng Liu,Chin-Hui
2024-05-27
Abstract:In this paper, we propose a variance-preserving interpolation framework to improve diffusion models for single-channel speech enhancement (SE) and automatic speech recognition (ASR). This new variance-preserving interpolation diffusion model (VPIDM) approach requires only 25 iterative steps and obviates the need for a corrector, an essential element in the existing variance-exploding interpolation diffusion model (VEIDM). Two notable distinctions between VPIDM and VEIDM are the scaling function of the mean of state variables and the constraint imposed on the variance relative to the mean's scale. We conduct a systematic exploration of the theoretical mechanism underlying VPIDM and develop insights regarding VPIDM's applications in SE and ASR using VPIDM as a frontend. Our proposed approach, evaluated on two distinct data sets, demonstrates VPIDM's superior performances over conventional discriminative SE algorithms. Furthermore, we assess the performance of the proposed model under varying signal-to-noise ratio (SNR) levels. The investigation reveals VPIDM's improved robustness in target noise elimination when compared to VEIDM. Furthermore, utilizing the mid-outputs of both VPIDM and VEIDM results in enhanced ASR accuracies, thereby highlighting the practical efficacy of our proposed approach.
Audio and Speech Processing
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the limitations of existing diffusion models (DMs) in single - channel speech enhancement (SE) and automatic speech recognition (ASR). Specifically, the author points out the following challenges: 1. **Efficiency problems of the existing variance - explosion interpolation diffusion model (VEIDM)**: VEIDM may not be able to efficiently enhance noisy speech under low signal - to - noise ratio (SNR) conditions, and a corrector is required to improve performance, which increases the computational cost. 2. **Initial error problem**: In the reverse process, since the clean speech cannot be obtained in the initial state \( S(T) \), it can only be approximately replaced by noisy speech, resulting in an initial error. This error has an adverse effect on the reverse process. 3. **Difficulties in directly applying DMs to SE tasks**: The original DMs were designed to predict more flexible distributions, while each noisy speech segment corresponds to only one clean speech segment, which does not match the requirements of SE tasks. Therefore, the direct application of DMs to SE tasks has a poor effect. To solve these problems, the author proposes a new variance - preserving interpolation diffusion model (VPIDM). This model has the following characteristics: - **Only 25 iteration steps are required**: Compared with VEIDM, VPIDM significantly reduces the required iteration steps. - **No corrector is required**: Through an improved interpolation method, VPIDM does not need an additional corrector, thereby reducing the computational complexity. - **Improved mean and variance processing**: VPIDM redefines the scaling function of the state variable mean and the variance constraint relative to the mean scale, making the model perform better in terms of theoretical mechanism. Through these improvements, VPIDM shows superior performance in single - channel speech enhancement and automatic speech recognition tasks, especially with improved robustness under different SNR conditions. Experimental results show that VPIDM outperforms traditional discriminative SE algorithms on large - scale datasets, and combining with VEIDM at the intermediate output stage can further improve the accuracy of ASR.