Pre-training Feature Guided Diffusion Model for Speech Enhancement

Yiyuan Yang,Niki Trigoni,Andrew Markham
2024-06-12
Abstract:Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments, improving communication and listening experiences. In this paper, we introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement, addressing the limitations of existing discriminative and generative models. By integrating spectral features into a variational autoencoder (VAE) and leveraging pre-trained features for guidance during the reverse process, coupled with the utilization of the deterministic discrete integration method (DDIM) to streamline sampling steps, our model improves efficiency and speech enhancement quality. Demonstrating state-of-the-art results on two public datasets with different SNRs, our model outshines other baselines in efficiency and robustness. The proposed method not only optimizes performance but also enhances practical deployment capabilities, without increasing computational demands.
Sound,Artificial Intelligence,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve speech clarity and comprehensibility in noisy environments, thereby enhancing communication and auditory experiences. Specifically, the article introduces a novel pre - trained feature - guided diffusion model (FUSE), which aims to enhance speech efficiently to overcome the limitations of existing discriminative and generative models. ### Specific manifestations of the problem 1. **Noise pollution**: In real - world applications, clean sound sources are inevitably affected by environmental noise, speaker interference, and codec degradation, resulting in a decline in signal quality and subsequently affecting subsequent recognition or monitoring tasks. 2. **Limitations of existing methods**: - **Discriminative models**: These models rely on supervised learning algorithms to map noisy speech to the corresponding clean - speech targets. Although they are effective in specific situations, they are difficult to adapt to new noise types, reverberations, or signal - to - noise ratios (SNR), and require a large amount of labeled data for training, increasing the deployment complexity. - **Generative models**: Although generative models show stronger robustness in dealing with unseen scenarios and can produce more natural speech, in practical applications, they have not yet become the first choice due to problems such as performance, efficiency, training steps, and inference time. ### Solutions proposed in the paper The paper proposes a pre - trained feature - guided unified diffusion model (FUSE) that combines two complementary acoustic features. The main innovations of this model include: 1. **Introducing pre - trained features as conditions**: Using the acoustic features extracted by the pre - trained model as conditions to guide the generation process of the diffusion model, thereby improving the generation effect. 2. **Using the deterministic discrete integration method (DDIM)**: By using DDIM to reduce the sampling steps, the efficiency is significantly improved while maintaining high - quality speech synthesis. 3. **Fusing the variational auto - encoder (VAE)**: Integrating spectral features into the VAE and using its latent feature maps for the training of the conditional diffusion model to further improve the efficiency. ### Experimental results The experimental results show that the FUSE model has achieved state - of - the - art results on two public datasets. It not only outperforms other baseline methods in multiple evaluation metrics but also performs well in terms of efficiency and robustness. In addition, the model can maintain stable performance under different SNR conditions, proving its applicability in complex noise environments. ### Summary The paper solves the challenges of existing speech enhancement methods in terms of efficiency and performance by introducing a pre - trained feature - guided unified diffusion model, providing an efficient and high - quality speech enhancement solution for practical applications.