Abstract:In this paper, we propose a variance-preserving interpolation framework to improve diffusion models for single-channel speech enhancement (SE) and automatic speech recognition (ASR). This new variance-preserving interpolation diffusion model (VPIDM) approach requires only 25 iterative steps and obviates the need for a corrector, an essential element in the existing variance-exploding interpolation diffusion model (VEIDM). Two notable distinctions between VPIDM and VEIDM are the scaling function of the mean of state variables and the constraint imposed on the variance relative to the mean's scale. We conduct a systematic exploration of the theoretical mechanism underlying VPIDM and develop insights regarding VPIDM's applications in SE and ASR using VPIDM as a frontend. Our proposed approach, evaluated on two distinct data sets, demonstrates VPIDM's superior performances over conventional discriminative SE algorithms. Furthermore, we assess the performance of the proposed model under varying signal-to-noise ratio (SNR) levels. The investigation reveals VPIDM's improved robustness in target noise elimination when compared to VEIDM. Furthermore, utilizing the mid-outputs of both VPIDM and VEIDM results in enhanced ASR accuracies, thereby highlighting the practical efficacy of our proposed approach.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the limitations of existing diffusion models (DMs) in single - channel speech enhancement (SE) and automatic speech recognition (ASR). Specifically, the author points out the following challenges: 1. **Efficiency problems of the existing variance - explosion interpolation diffusion model (VEIDM)**: VEIDM may not be able to efficiently enhance noisy speech under low signal - to - noise ratio (SNR) conditions, and a corrector is required to improve performance, which increases the computational cost. 2. **Initial error problem**: In the reverse process, since the clean speech cannot be obtained in the initial state \( S(T) \), it can only be approximately replaced by noisy speech, resulting in an initial error. This error has an adverse effect on the reverse process. 3. **Difficulties in directly applying DMs to SE tasks**: The original DMs were designed to predict more flexible distributions, while each noisy speech segment corresponds to only one clean speech segment, which does not match the requirements of SE tasks. Therefore, the direct application of DMs to SE tasks has a poor effect. To solve these problems, the author proposes a new variance - preserving interpolation diffusion model (VPIDM). This model has the following characteristics: - **Only 25 iteration steps are required**: Compared with VEIDM, VPIDM significantly reduces the required iteration steps. - **No corrector is required**: Through an improved interpolation method, VPIDM does not need an additional corrector, thereby reducing the computational complexity. - **Improved mean and variance processing**: VPIDM redefines the scaling function of the state variable mean and the variance constraint relative to the mean scale, making the model perform better in terms of theoretical mechanism. Through these improvements, VPIDM shows superior performance in single - channel speech enhancement and automatic speech recognition tasks, especially with improved robustness under different SNR conditions. Experimental results show that VPIDM outperforms traditional discriminative SE algorithms on large - scale datasets, and combining with VEIDM at the intermediate output stage can further improve the accuracy of ASR.

A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and Recognition

Variance-Preserving-Based Interpolation Diffusion Models for Speech Enhancement

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Adaptive Beamforming Based on Interference-Plus-Noise Covariance Matrix Reconstruction for Speech Separation

Pre-training Feature Guided Diffusion Model for Speech Enhancement

NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement

An Analysis of the Variance of Diffusion-based Speech Enhancement

Investigating the Design Space of Diffusion Models for Speech Enhancement

A Study on Speech Enhancement Based on Diffusion Probabilistic Model

Diff-SV: A Unified Hierarchical Framework for Noise-Robust Speaker Verification Using Score-Based Diffusion Probabilistic Models

An Improved VTS Feature Compensation Using Mixture Models of Distortion and IVN Training for Noisy Speech Recognition

Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model

Conditioning and Sampling in Variational Diffusion Models for Speech Super-Resolution

Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model

Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

A Modified Speech Enhancement Algorithm Using a Universal Speaker Model

Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model

LC4SV: A Denoising Framework Learning to Compensate for Unseen Speaker Verification Models