Abstract:Voice cloning in text-to-speech (TTS) is the process of replicating the voice of a target speaker with limited data. Among various voice cloning techniques, this paper focuses on zero-shot voice cloning. Although existing TTS models can generate high-quality speech for seen speakers, cloning the voice of an unseen speaker remains a challenging task. The key aspect of zero-shot voice cloning is to obtain a speaker embedding from the target speaker. Previous works have used a speaker encoder to obtain a fixed-size speaker embedding from a single reference audio unsupervised, but they suffer from insufficient speaker information and content information leakage in speaker embedding.To address these issues, this paper proposes MRMI-TTS, a FastSpeech2-based framework that uses speaker embedding as a conditioning variable to provide speaker information. The MRMI-TTS extracts speaker embedding and content embedding from multi-reference audios using a speaker encoder and a content encoder. To obtain sufficient speaker information, multi-reference audios are selected based on sentence similarity. The proposed model applies mutual information minimization on the two embeddings to remove entangled information within each embedding.Experiments on the public English dataset VCTK show that our method can improve synthesized speech in terms of both similarity and naturalness, even for unseen speakers. Compared to state-of-the-art reference embedding learned methods, our method achieves the best performance on the zero-shot voice cloning task. Furthermore, we demonstrate that the proposed method has a better capability of maintaining the speaker embedding in different languages. Sample outputs are available on the demo page.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are two main challenges in zero - shot voice cloning: 1. **Extract sufficient speaker information**: In zero - shot voice cloning, the model needs to extract rich enough speaker embeddings from a small amount of reference audio of the target speaker to ensure the similarity of the synthesized voice. However, existing methods usually use a single reference audio, which will lead to insufficient speaker information. 2. **Avoid content information leakage into speaker embeddings**: When extracting speaker embeddings, if content information (such as semantics, emotions, etc.) is mixed in, it will affect the quality and naturalness of the synthesized voice. Therefore, how to effectively separate speaker information and content information is a key issue. To solve these problems, the paper proposes a new framework - MRMI - TTS (Multi - reference audios and Mutual Information Driven Zero - shot Voice Cloning). This framework addresses the above challenges in the following ways: - **Multi - reference audio selection**: In order to obtain more abundant speaker information, the paper selects multiple reference audios with large content differences, thus covering more speaker characteristics. - **Speaker and Content Decoupling Module (SCDM)**: A speaker encoder and a content encoder are introduced to extract speaker embeddings and content embeddings respectively, and the dependence between the two is reduced through the Mutual Information Minimization (MI minimization) technique, thereby avoiding content information leakage into speaker embeddings. Through these improvements, MRMI - TTS can generate higher - quality and more natural synthesized voices in zero - shot voice cloning tasks, and can perform well even for unseen speakers. ### Formula summary 1. **Normalization formula**: \[ \hat{h}=\frac{h - \mu}{\sigma} \] where, \[ \mu=\frac{1}{T}\sum_{t = 1}^{T}h_t,\quad\sigma=\sqrt{\frac{1}{T}\sum_{t = 1}^{T}(h_t-\mu)^2} \] 2. **Conditional layer normalization formula**: \[ y = \gamma\cdot\hat{h}+\beta \] where \(\gamma\) and \(\beta\) are the gain and bias predicted by the speaker embedding. 3. **VQ loss function**: \[ L_{VQ}=\|z_e(x)-e\|_2^2+\|z_q(x)-e\|_2^2 \] where \(z_e(x)\) represents the encoded continuous hidden state, and \(e\) represents the discrete hidden state. 4. **InfoNCE loss function**: \[ L_{InfoNCE}=-\mathbb{E}\left[\log\frac{\exp(f(x_i^+,x_i))}{\sum_{x_j\in N(x_i)}\exp(f(x_j,x_i))}\right] \] 5. **Upper bound estimation of mutual information**: \[ I_{vCLUB}(s,c)=\frac{1}{B^2}\sum_{i = 1}^{B}\sum_{j = 1}^{B}\sum_{k = 1}^{B}[\log p_\theta(s_j,c_k|s_i)-\log p_\theta(s_j,c_k|s_k)] \] 6. **Total loss function**: \[ L_{total}=L_{FS2}+L_{SCDM}+L_{adv}

MRMI-TTS: Multi-reference audios and Mutual Information Driven Zero-shot Voice cloning

Multi-modal Adversarial Training for Zero-Shot Voice Cloning

OpenVoice: Versatile Instant Voice Cloning

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers

Zero-shot Cross-lingual Voice Transfer for TTS

A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge

The THU-HCSI Multi-Speaker Multi-Lingual Few-Shot Voice Cloning System for LIMMITS'24 Challenge

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

UNET-TTS: Improving Unseen Speaker and Style Transfer in One-Shot Voice Cloning

A real-time voice cloning system with multiple algorithms for speech quality improvement

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS

Data Efficient Voice Cloning for Neural Singing Synthesis

Variational Auto-Encoder based Mandarin Speech Cloning

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis