Abstract:A text-to-speech (TTS) model trained to reconstruct speech given text tends towards predictions that are close to the average characteristics of a dataset, failing to model the variations that make human speech sound natural. This problem is magnified for zero-shot voice cloning, a task that requires training data with high variance in speaking styles. We build off of recent works which have used Generative Advsarial Networks (GAN) by proposing a Transformer encoder-decoder architecture to conditionally discriminates between real and generated speech features. The discriminator is used in a training pipeline that improves both the acoustic and prosodic features of a TTS model. We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zero-shot voice cloning. Our model achieves improvements over the baseline in terms of speech quality and speaker similarity. Audio examples from our system are available online.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the zero - shot voice cloning task, how to improve the naturalness and expressiveness of the text - to - speech synthesis (TTS) system, especially when dealing with multi - speaker datasets, the model is prone to generate overly smooth and lack - of - diversity voice features. Specifically, the paper points out the following problems: 1. **Over - smoothing problem**: Traditional TTS models tend to predict voice features close to the average features of the dataset during training, resulting in the generated voice lacking naturalness and diversity. 2. **Quality gap in zero - shot voice cloning**: Compared with supervised or few - shot systems, zero - shot voice cloning systems have an obvious gap in voice quality and speaker similarity. 3. **Multi - modal mapping problem**: The mapping from text to voice is one - to - many, that is, the same text can correspond to multiple different voice expressions. Existing TTS models are difficult to effectively capture this diversity. To solve these problems, the paper proposes a multi - modal adversarial training method based on the generative adversarial network (GAN), specifically including the following aspects: - **Multi - modal Fusion Discriminator**: By introducing a discriminator with a Transformer encoder - decoder structure, combining text information and speaker embeddings, to more comprehensively evaluate the generated voice features. - **Multi - feature Generative Adversarial Training**: Optimize not only acoustic features (such as mel - spectrogram), but also prosodic features (such as pitch, energy, duration, etc.), thereby improving the naturalness and expressiveness of the generated voice. Through these innovations, the paper aims to bridge the quality gap between zero - shot voice cloning systems and supervised systems and generate more natural and expressive voices. ### Formula Summary - **Generative Loss**: \[ L_{Ga}=\text{MAE}(\hat{y}_a, y_a) \] where \(\hat{y}_a\) is the predicted acoustic feature, \(y_a\) is the real acoustic feature, and MAE represents the mean absolute error. - **Adversarial Loss**: \[ L_{Aa}=-D_a(\hat{y}_a|x_t, x_s) \] where \(D_a\) is the discriminator of acoustic features, \(x_t\) is the input text information, and \(x_s\) is the speaker representation. - **Discriminator Loss**: \[ L_{Da}=-\min(0, D_a(y_a|x_t, x_s)-1)-\min(0, -D_a(\hat{y}_a|x_t, x_s)-1) \] - **Prosodic Feature Generative Loss**: \[ L_{Gp}=\text{MSE}(\hat{y}_p, y_p) \] where \(\hat{y}_p\) is the predicted prosodic feature, \(y_p\) is the real prosodic feature, and MSE represents the mean square error. - **Prosodic Feature Adversarial Loss**: \[ L_{Ap}=-D_p(\hat{y}_p|x_t, x_s) \] \[ L_{Dp}=-\min(0, D_p(y_p|x_t, x_s)-1)-\min(0, -D_p(\hat{y}_p|x_t, x_s)-1) \] - **Total Optimization Loss**: \[ L_{GA}=L_{Ga}

Multi-modal Adversarial Training for Zero-Shot Voice Cloning

Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech

Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

MRMI-TTS: Multi-reference audios and Mutual Information Driven Zero-shot Voice cloning

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

High Fidelity Speech Synthesis with Adversarial Networks

SF-Speech: Straightened Flow for Zero-Shot Voice Clone on Small-Scale Dataset

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

Data Efficient Voice Cloning for Neural Singing Synthesis

NRAdapt: Noise-Robust Adaptive Text to Speech Using Untranscribed Data

Zero-shot Cross-lingual Voice Transfer for TTS

Multi-target Voice Conversion Without Parallel Data by Adversarially Learning Disentangled Audio Representations

A real-time voice cloning system with multiple algorithms for speech quality improvement

SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs

DIAN: DURATION INFORMED AUTO-REGRESSIVE NETWORK FOR VOICE CLONING

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations