Abstract:Diffusion models have achieved remarkable success in text-to-speech (TTS), even in zero-shot scenarios. Recent efforts aim to address the trade-off between inference speed and sound quality, often considered the primary drawback of diffusion models. However, we find a critical mispronunciation issue is being overlooked. Our preliminary study reveals the unstable pronunciation resulting from the diffusion process. Based on this observation, we introduce StableForm-TTS, a novel zero-shot speech synthesis framework designed to produce robust pronunciation while maintaining the advantages of diffusion modeling. By pioneering the adoption of source-filter theory in diffusion TTS, we propose an elaborate architecture for stable formant generation. Experimental results on unseen speakers show that our model outperforms the state-of-the-art method in terms of pronunciation accuracy and naturalness, with comparable speaker similarity. Moreover, our model demonstrates effective scalability as both data and model sizes increase.

What problem does this paper attempt to address?

This paper attempts to solve the problem of pronunciation instability faced by diffusion models in zero - shot speech synthesis. Specifically, the author found that although diffusion models perform well in single - speaker scenarios, there are serious pronunciation error problems in zero - shot scenarios. This problem is mainly caused by the following factors: 1. **Complexity of the target data distribution**: The data distribution in the zero - shot scenario is the most complex, resulting in a significant decrease in pronunciation accuracy. 2. **Randomness of the diffusion process**: As the number of reverse steps increases, especially in the SDE solver, the randomness of diffusion accumulates, further affecting the pronunciation quality. 3. **Damage to phoneme signals**: Phoneme signals with weak amplitude or contrast (such as formants) are especially vulnerable to damage. To solve these problems, the author proposed a new zero - shot speech - synthesis framework - StableForm - TTS. By introducing the source - filter theory, this framework aims to generate stable formant representations, thereby improving the stability of pronunciation. Specific improvements include: - **Introducing two inductive biases**: 1) Variance features to alleviate the over - smoothing problem of speech signals; 2) A new architecture based on the source - filter theory. - **Adopting a dual - path structure**: Decomposing the speech signal into two paths, excitation and formant, to process intonation information and non - intonation information respectively, ensuring the stability and naturalness of pronunciation. Experimental results show that StableForm - TTS significantly outperforms existing methods on unseen speaker data, performs excellently in terms of pronunciation accuracy and naturalness, while maintaining similar speaker similarity. In addition, this model also shows good scalability and can maintain performance when the data and model scale increase. ### Formula summary The formulas involved in this paper mainly include the forward and reverse diffusion processes of the diffusion model, as follows: - **Forward diffusion process**: \[ dX_t=\frac{1}{2}\beta_t(\mu - X_t)dt+\sqrt{\beta_t}dW_t \] where \(t\in[0, T]\) is the continuous time step, \(\beta_t\) is the non - negative noise schedule, \(\mu\) is the data - driven prior, and \(W_t\) is the Brownian motion. - **Reverse diffusion process**: \[ dX_t=\left(\frac{1}{2}(\mu - X_t)-\nabla\log p(X_t)\right)\frac{dt}{\beta_t}+\sqrt{\beta_t}dfW_t \] where \(fW_t\) is the Brownian motion in reverse time, and \(\nabla\log p(X_t)\) is the gradient of the log - density of the noisy data, called the score. The introduction of these formulas enables the model to effectively learn the process of reconstructing the original data from Gaussian noise, thereby improving the quality and diversity of speech synthesis.

Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

FlashSpeech: Efficient Zero-Shot Speech Synthesis

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion

Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS

LatentSpeech: Latent Diffusion for Text-To-Speech Generation

DiffVoice: Text-to-Speech with Latent Diffusion

Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis

ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model