John Janiczek,Dading Chong,Dongyang Dai,Arlo Faria,Chao Wang,Tao Wang,Yuzong Liu
Abstract:A text-to-speech (TTS) model trained to reconstruct speech given text tends towards predictions that are close to the average characteristics of a dataset, failing to model the variations that make human speech sound natural. This problem is magnified for zero-shot voice cloning, a task that requires training data with high variance in speaking styles. We build off of recent works which have used Generative Advsarial Networks (GAN) by proposing a Transformer encoder-decoder architecture to conditionally discriminates between real and generated speech features. The discriminator is used in a training pipeline that improves both the acoustic and prosodic features of a TTS model. We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zero-shot voice cloning. Our model achieves improvements over the baseline in terms of speech quality and speaker similarity. Audio examples from our system are available online.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the zero - shot voice cloning task, how to improve the naturalness and expressiveness of the text - to - speech synthesis (TTS) system, especially when dealing with multi - speaker datasets, the model is prone to generate overly smooth and lack - of - diversity voice features.
Specifically, the paper points out the following problems:
1. **Over - smoothing problem**: Traditional TTS models tend to predict voice features close to the average features of the dataset during training, resulting in the generated voice lacking naturalness and diversity.
2. **Quality gap in zero - shot voice cloning**: Compared with supervised or few - shot systems, zero - shot voice cloning systems have an obvious gap in voice quality and speaker similarity.
3. **Multi - modal mapping problem**: The mapping from text to voice is one - to - many, that is, the same text can correspond to multiple different voice expressions. Existing TTS models are difficult to effectively capture this diversity.
To solve these problems, the paper proposes a multi - modal adversarial training method based on the generative adversarial network (GAN), specifically including the following aspects:
- **Multi - modal Fusion Discriminator**: By introducing a discriminator with a Transformer encoder - decoder structure, combining text information and speaker embeddings, to more comprehensively evaluate the generated voice features.
- **Multi - feature Generative Adversarial Training**: Optimize not only acoustic features (such as mel - spectrogram), but also prosodic features (such as pitch, energy, duration, etc.), thereby improving the naturalness and expressiveness of the generated voice.
Through these innovations, the paper aims to bridge the quality gap between zero - shot voice cloning systems and supervised systems and generate more natural and expressive voices.
### Formula Summary
- **Generative Loss**:
\[
L_{Ga}=\text{MAE}(\hat{y}_a, y_a)
\]
where \(\hat{y}_a\) is the predicted acoustic feature, \(y_a\) is the real acoustic feature, and MAE represents the mean absolute error.
- **Adversarial Loss**:
\[
L_{Aa}=-D_a(\hat{y}_a|x_t, x_s)
\]
where \(D_a\) is the discriminator of acoustic features, \(x_t\) is the input text information, and \(x_s\) is the speaker representation.
- **Discriminator Loss**:
\[
L_{Da}=-\min(0, D_a(y_a|x_t, x_s)-1)-\min(0, -D_a(\hat{y}_a|x_t, x_s)-1)
\]
- **Prosodic Feature Generative Loss**:
\[
L_{Gp}=\text{MSE}(\hat{y}_p, y_p)
\]
where \(\hat{y}_p\) is the predicted prosodic feature, \(y_p\) is the real prosodic feature, and MSE represents the mean square error.
- **Prosodic Feature Adversarial Loss**:
\[
L_{Ap}=-D_p(\hat{y}_p|x_t, x_s)
\]
\[
L_{Dp}=-\min(0, D_p(y_p|x_t, x_s)-1)-\min(0, -D_p(\hat{y}_p|x_t, x_s)-1)
\]
- **Total Optimization Loss**:
\[
L_{GA}=L_{Ga}