DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

Yingahao Aaron Li,Rithesh Kumar,Zeyu Jin
2024-10-15
Abstract:Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are inefficient and hinder the application of end-to-end optimization with perceptual metrics. In this paper, we propose a novel method of distilling TTS diffusion models with direct end-to-end evaluation metric optimization, achieving state-of-the-art performance. By incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss, our approach optimizes perceptual evaluation metrics, leading to notable improvements in word error rate and speaker similarity. Our experiments show that DMDSpeech consistently surpasses prior state-of-the-art models in both naturalness and speaker similarity while being significantly faster. Moreover, our synthetic speech has a higher level of voice similarity to the prompt than the ground truth in both human evaluation and objective speaker similarity metric. This work highlights the potential of direct metric optimization in speech synthesis, allowing models to better align with human auditory preferences. The audio samples are available at <a class="link-external link-https" href="https://dmdspeech.github.io/" rel="external noopener nofollow">this https URL</a>.
Audio and Speech Processing,Artificial Intelligence,Sound
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the quality and efficiency of zero - sample speech synthesis based on diffusion models. Specifically, the paper proposes a new method named DMDSpeech, which improves the existing speech synthesis technology in the following aspects: 1. **Reduce inference time**: Traditional diffusion models require iterative sampling when generating speech, which leads to high computational costs and long generation times. DMDSpeech significantly reduces the inference time by using distillation technology to convert a complex teacher model into a four - step - generation student model. 2. **Optimize perceptual metrics**: In order to make the generated speech more in line with human auditory preferences, DMDSpeech introduces direct metric optimization. Specifically, the model uses Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss to optimize text alignment and speaker similarity respectively, improving the naturalness of the generated speech and speaker similarity. 3. **Improve generation quality**: Through the above techniques and methods, DMDSpeech not only outperforms previous state - of - the - art models in naturalness and speaker similarity, but also outperforms real recordings in terms of the similarity between the synthesized speech and the prompt speech, both from human evaluation and objective speaker similarity metrics. In summary, the main goal of this paper is to achieve efficient and high - quality zero - sample speech synthesis by improving the training and generation processes of diffusion models, especially achieving significant improvements in naturalness, speaker similarity, and inference speed.