DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

Yingahao Aaron Li,Rithesh Kumar,Zeyu Jin

2024-10-15

Abstract:Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are inefficient and hinder the application of end-to-end optimization with perceptual metrics. In this paper, we propose a novel method of distilling TTS diffusion models with direct end-to-end evaluation metric optimization, achieving state-of-the-art performance. By incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss, our approach optimizes perceptual evaluation metrics, leading to notable improvements in word error rate and speaker similarity. Our experiments show that DMDSpeech consistently surpasses prior state-of-the-art models in both naturalness and speaker similarity while being significantly faster. Moreover, our synthetic speech has a higher level of voice similarity to the prompt than the ground truth in both human evaluation and objective speaker similarity metric. This work highlights the potential of direct metric optimization in speech synthesis, allowing models to better align with human auditory preferences. The audio samples are available at <a class="link-external link-https" href="https://dmdspeech.github.io/" rel="external noopener nofollow">this https URL</a>.

Audio and Speech Processing,Artificial Intelligence,Sound

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the quality and efficiency of zero - sample speech synthesis based on diffusion models. Specifically, the paper proposes a new method named DMDSpeech, which improves the existing speech synthesis technology in the following aspects: 1. **Reduce inference time**: Traditional diffusion models require iterative sampling when generating speech, which leads to high computational costs and long generation times. DMDSpeech significantly reduces the inference time by using distillation technology to convert a complex teacher model into a four - step - generation student model. 2. **Optimize perceptual metrics**: In order to make the generated speech more in line with human auditory preferences, DMDSpeech introduces direct metric optimization. Specifically, the model uses Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss to optimize text alignment and speaker similarity respectively, improving the naturalness of the generated speech and speaker similarity. 3. **Improve generation quality**: Through the above techniques and methods, DMDSpeech not only outperforms previous state - of - the - art models in naturalness and speaker similarity, but also outperforms real recordings in terms of the similarity between the synthesized speech and the prompt speech, both from human evaluation and objective speaker similarity metrics. In summary, the main goal of this paper is to achieve efficient and high - quality zero - sample speech synthesis by improving the training and generation processes of diffusion models, especially achieving significant improvements in naturalness, speaker similarity, and inference speed.

DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR

DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models

Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models

Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion

FastDiff 2: Revisiting and Incorporating GANs and Diffusion Models in High-Fidelity Speech Synthesis

Deep Metric Learning For The Target Cost In Unit-Selection Speech Synthesizer

Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model

CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model