Abstract:Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting. Audio samples are available at \url{https://ProDiff.github.io/.}

Differentiable Duration Refinement Using Internal Division for Non-Autoregressive Text-to-Speech

AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling

Expressive, Variable, and Controllable Duration Modelling in TTS

Duration Modeling of Neural TTS for Automatic Dubbing

Duration optimization of speaker adaptation in Mandarin TTS

End-to-End Text-to-Speech using Latent Duration based on VQ-VAE

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

Non-Autoregressive End-to-End TTS with Coarse-to-Fine Decoding

On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition

Controllable Text-to-Audio Generation with Training-Free Temporal Guidance Diffusion

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech

Balanced SNR-Aware Distillation for Guided Text-to-Audio Generation

Teacher-Student Training For Robust Tacotron-Based TTS