MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

Taejun Bak,Youngsik Eom,SeungJae Choi,Young-Sun Joo
2024-10-04
Abstract:Text-to-speech (TTS) systems that scale up the amount of training data have achieved significant improvements in zero-shot speech synthesis. However, these systems have certain limitations: they require a large amount of training data, which increases costs, and often overlook prosody similarity. To address these issues, we propose MultiVerse, a zero-shot multi-task TTS system that is able to perform TTS or speech style transfer in zero-shot and cross-lingual conditions. MultiVerse requires much less training data than traditional data-driven approaches. To ensure zero-shot performance even with limited data, we leverage source-filter theory-based disentanglement, utilizing the prompt for modeling filter-related and source-related representations. Additionally, to further enhance prosody similarity, we adopt a prosody modeling approach combining prompt-based autoregressive and non-autoregressive methods. Evaluations demonstrate the remarkable zero-shot multi-task TTS performance of MultiVerse and show that MultiVerse not only achieves zero-shot TTS performance comparable to data-driven TTS systems with much less data, but also significantly outperforms other zero-shot TTS systems trained with the same small amount of data. In particular, our novel prosody modeling technique significantly contributes to MultiVerse's ability to generate speech with high prosody similarity to the given prompts. Our samples are available at <a class="link-external link-https" href="https://nc-ai.github.io/speech/publications/multiverse/index.html" rel="external noopener nofollow">this https URL</a>
Audio and Speech Processing,Artificial Intelligence,Sound
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are how to achieve high - quality zero - sample multi - task text - to - speech (TTS) synthesis and cross - language voice style transfer while reducing the amount of training data. Specifically: 1. **Reducing the need for training data**: Traditional large - scale data - driven methods require a large amount of training data to ensure good generalization ability, which not only increases costs but also poses challenges to the support of minority languages. The method proposed in the paper aims to achieve similar effects with less data. 2. **Improving prosodic similarity**: Existing zero - sample TTS systems often overlook prosodic similarity, resulting in generated voices that do not match the original voices in prosody. By introducing prompt - based autoregressive and non - autoregressive prosodic modeling methods, the paper significantly improves prosodic similarity. 3. **Achieving multi - task processing**: The MultiVerse system proposed in the paper can perform multiple tasks under zero - sample conditions, including zero - sample TTS, cross - language TTS, and voice style transfer. These tasks can be carried out individually or in combination, such as zero - sample cross - language voice style transfer. ### Specific solutions - **Decoupled modeling based on source - filter theory**: By decomposing voice generation into filter - related representations and source - related representations, and using prompt voices to model these two representations, the dependence on a large amount of data is reduced. - **Enhanced prosodic modeling**: A two - stage prosodic modeling method is adopted. First, an autoregressive model is used to predict prosody - related acoustic features, and then the prosody is further optimized in the latent space through a non - autoregressive method. - **Multi - task learning**: By flexibly adjusting input conditions, MultiVerse can switch between different tasks, such as zero - sample TTS, cross - language TTS, and voice style transfer. ### Experimental results The experimental results show that MultiVerse can generate high - quality voices with only a small amount of training data under zero - sample and cross - language conditions, and is significantly superior to other zero - sample TTS systems in terms of prosodic similarity. In addition, although the amount of training data used is far less than that of large - scale data - driven models, MultiVerse still performs well in terms of naturalness and similarity. In summary, through innovative modeling methods and technical means, the paper effectively solves the problems of large data requirements and low prosodic similarity in existing zero - sample TTS systems.