Abstract:The goal of speech enhancement (SE) is to eliminate the background interference from the noisy speech signal. Generative models such as diffusion models (DM) have been applied to the task of SE because of better generalization in unseen noisy scenes. Technical routes for the DM-based SE methods can be summarized into three types: task-adapted diffusion process formulation, generator-plus-conditioner (GPC) structures and the multi-stage frameworks. We focus on the first two approaches, which are constructed under the GPC architecture and use the task-adapted diffusion process to better deal with the real noise. However, the performance of these SE models is limited by the following issues: (a) Non-Gaussian noise estimation in the task-adapted diffusion process. (b) Conditional domain bias caused by the weak conditioner design in the GPC structure. (c) Large amount of residual noise caused by unreasonable interpolation operations during inference. To solve the above problems, we propose a noise-aware diffusion-based SE model (NADiffuSE) to boost the SE performance, where the noise representation is extracted from the noisy speech signal and introduced as a global conditional information for estimating the non-Gaussian components. Furthermore, the anchor-based inference algorithm is employed to achieve a compromise between the speech distortion and noise residual. In order to mitigate the performance degradation caused by the conditional domain bias in the GPC framework, we investigate three model variants, all of which can be viewed as multi-stage SE based on the preprocessing networks for Mel spectrograms. Experimental results show that NADiffuSE outperforms other DM-based SE models under the GPC infrastructure. Audio samples are available at: <a class="link-external link-https" href="https://square-of-w.github.io/NADiffuSE-demo/" rel="external noopener nofollow">this https URL</a>.

FastDiff 2: Revisiting and Incorporating GANs and Diffusion Models in High-Fidelity Speech Synthesis

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion

Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS

Revisiting Denoising Diffusion Probabilistic Models for Speech Enhancement: Condition Collapse, Efficiency and Refinement

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Conditional GAN for Enhancing Diffusion Models in Efficient and Authentic Global Gesture Generation from Audios

MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

Single and Few-step Diffusion for Generative Speech Enhancement

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement

DiffVoice: Text-to-Speech with Latent Diffusion

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

Enhancing Unsupervised Speech Recognition with Diffusion GANs