Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation

Changjin Han,Seokgi Lee,Gyuhyeon Nam,Gyeongsu Chae
2024-09-14
Abstract:Diffusion models have achieved remarkable success in text-to-speech (TTS), even in zero-shot scenarios. Recent efforts aim to address the trade-off between inference speed and sound quality, often considered the primary drawback of diffusion models. However, we find a critical mispronunciation issue is being overlooked. Our preliminary study reveals the unstable pronunciation resulting from the diffusion process. Based on this observation, we introduce StableForm-TTS, a novel zero-shot speech synthesis framework designed to produce robust pronunciation while maintaining the advantages of diffusion modeling. By pioneering the adoption of source-filter theory in diffusion TTS, we propose an elaborate architecture for stable formant generation. Experimental results on unseen speakers show that our model outperforms the state-of-the-art method in terms of pronunciation accuracy and naturalness, with comparable speaker similarity. Moreover, our model demonstrates effective scalability as both data and model sizes increase.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
This paper attempts to solve the problem of pronunciation instability faced by diffusion models in zero - shot speech synthesis. Specifically, the author found that although diffusion models perform well in single - speaker scenarios, there are serious pronunciation error problems in zero - shot scenarios. This problem is mainly caused by the following factors: 1. **Complexity of the target data distribution**: The data distribution in the zero - shot scenario is the most complex, resulting in a significant decrease in pronunciation accuracy. 2. **Randomness of the diffusion process**: As the number of reverse steps increases, especially in the SDE solver, the randomness of diffusion accumulates, further affecting the pronunciation quality. 3. **Damage to phoneme signals**: Phoneme signals with weak amplitude or contrast (such as formants) are especially vulnerable to damage. To solve these problems, the author proposed a new zero - shot speech - synthesis framework - StableForm - TTS. By introducing the source - filter theory, this framework aims to generate stable formant representations, thereby improving the stability of pronunciation. Specific improvements include: - **Introducing two inductive biases**: 1) Variance features to alleviate the over - smoothing problem of speech signals; 2) A new architecture based on the source - filter theory. - **Adopting a dual - path structure**: Decomposing the speech signal into two paths, excitation and formant, to process intonation information and non - intonation information respectively, ensuring the stability and naturalness of pronunciation. Experimental results show that StableForm - TTS significantly outperforms existing methods on unseen speaker data, performs excellently in terms of pronunciation accuracy and naturalness, while maintaining similar speaker similarity. In addition, this model also shows good scalability and can maintain performance when the data and model scale increase. ### Formula summary The formulas involved in this paper mainly include the forward and reverse diffusion processes of the diffusion model, as follows: - **Forward diffusion process**: \[ dX_t=\frac{1}{2}\beta_t(\mu - X_t)dt+\sqrt{\beta_t}dW_t \] where \(t\in[0, T]\) is the continuous time step, \(\beta_t\) is the non - negative noise schedule, \(\mu\) is the data - driven prior, and \(W_t\) is the Brownian motion. - **Reverse diffusion process**: \[ dX_t=\left(\frac{1}{2}(\mu - X_t)-\nabla\log p(X_t)\right)\frac{dt}{\beta_t}+\sqrt{\beta_t}dfW_t \] where \(fW_t\) is the Brownian motion in reverse time, and \(\nabla\log p(X_t)\) is the gradient of the log - density of the noisy data, called the score. The introduction of these formulas enables the model to effectively learn the process of reconstructing the original data from Gaussian noise, thereby improving the quality and diversity of speech synthesis.