MRMI-TTS: Multi-reference audios and Mutual Information Driven Zero-shot Voice cloning

Yiting Chen,Wanting Li,Buzhou Tang
DOI: https://doi.org/10.1145/3649501
IF: 1.471
2024-03-30
ACM Transactions on Asian and Low-Resource Language Information Processing
Abstract:Voice cloning in text-to-speech (TTS) is the process of replicating the voice of a target speaker with limited data. Among various voice cloning techniques, this paper focuses on zero-shot voice cloning. Although existing TTS models can generate high-quality speech for seen speakers, cloning the voice of an unseen speaker remains a challenging task. The key aspect of zero-shot voice cloning is to obtain a speaker embedding from the target speaker. Previous works have used a speaker encoder to obtain a fixed-size speaker embedding from a single reference audio unsupervised, but they suffer from insufficient speaker information and content information leakage in speaker embedding.To address these issues, this paper proposes MRMI-TTS, a FastSpeech2-based framework that uses speaker embedding as a conditioning variable to provide speaker information. The MRMI-TTS extracts speaker embedding and content embedding from multi-reference audios using a speaker encoder and a content encoder. To obtain sufficient speaker information, multi-reference audios are selected based on sentence similarity. The proposed model applies mutual information minimization on the two embeddings to remove entangled information within each embedding.Experiments on the public English dataset VCTK show that our method can improve synthesized speech in terms of both similarity and naturalness, even for unseen speakers. Compared to state-of-the-art reference embedding learned methods, our method achieves the best performance on the zero-shot voice cloning task. Furthermore, we demonstrate that the proposed method has a better capability of maintaining the speaker embedding in different languages. Sample outputs are available on the demo page.
computer science, artificial intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two main challenges in zero - shot voice cloning: 1. **Extract sufficient speaker information**: In zero - shot voice cloning, the model needs to extract rich enough speaker embeddings from a small amount of reference audio of the target speaker to ensure the similarity of the synthesized voice. However, existing methods usually use a single reference audio, which will lead to insufficient speaker information. 2. **Avoid content information leakage into speaker embeddings**: When extracting speaker embeddings, if content information (such as semantics, emotions, etc.) is mixed in, it will affect the quality and naturalness of the synthesized voice. Therefore, how to effectively separate speaker information and content information is a key issue. To solve these problems, the paper proposes a new framework - MRMI - TTS (Multi - reference audios and Mutual Information Driven Zero - shot Voice Cloning). This framework addresses the above challenges in the following ways: - **Multi - reference audio selection**: In order to obtain more abundant speaker information, the paper selects multiple reference audios with large content differences, thus covering more speaker characteristics. - **Speaker and Content Decoupling Module (SCDM)**: A speaker encoder and a content encoder are introduced to extract speaker embeddings and content embeddings respectively, and the dependence between the two is reduced through the Mutual Information Minimization (MI minimization) technique, thereby avoiding content information leakage into speaker embeddings. Through these improvements, MRMI - TTS can generate higher - quality and more natural synthesized voices in zero - shot voice cloning tasks, and can perform well even for unseen speakers. ### Formula summary 1. **Normalization formula**: \[ \hat{h}=\frac{h - \mu}{\sigma} \] where, \[ \mu=\frac{1}{T}\sum_{t = 1}^{T}h_t,\quad\sigma=\sqrt{\frac{1}{T}\sum_{t = 1}^{T}(h_t-\mu)^2} \] 2. **Conditional layer normalization formula**: \[ y = \gamma\cdot\hat{h}+\beta \] where \(\gamma\) and \(\beta\) are the gain and bias predicted by the speaker embedding. 3. **VQ loss function**: \[ L_{VQ}=\|z_e(x)-e\|_2^2+\|z_q(x)-e\|_2^2 \] where \(z_e(x)\) represents the encoded continuous hidden state, and \(e\) represents the discrete hidden state. 4. **InfoNCE loss function**: \[ L_{InfoNCE}=-\mathbb{E}\left[\log\frac{\exp(f(x_i^+,x_i))}{\sum_{x_j\in N(x_i)}\exp(f(x_j,x_i))}\right] \] 5. **Upper bound estimation of mutual information**: \[ I_{vCLUB}(s,c)=\frac{1}{B^2}\sum_{i = 1}^{B}\sum_{j = 1}^{B}\sum_{k = 1}^{B}[\log p_\theta(s_j,c_k|s_i)-\log p_\theta(s_j,c_k|s_k)] \] 6. **Total loss function**: \[ L_{total}=L_{FS2}+L_{SCDM}+L_{adv}