Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Sijing Chen,Yuan Feng,Laipeng He,Tianwei He,Wendi He,Yanni Hu,Bin Lin,Yiting Lin,Yu Pan,Pengfei Tan,Chengwei Tian,Chen Wang,Zhicheng Wang,Ruoye Xie,Jixun Yao,Quanlei Yan,Yuguang Yang,Jianhao Ye,Jingjing Yin,Yanzhen Yu,Huimin Zhang,Xiang Zhang,Guangcheng Zhao,Hongbin Zhou,Pengpeng Zou
2024-09-24
Abstract:With the advent of the big data and large language model era, zero-shot personalized rapid customization has emerged as a significant trend. In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production. These models are capable of zero-shot speech production, generating high-quality speech that is nearly indistinguishable from real human speech and facilitating individuals to customize the speech content according to their own needs. Specifically, we first introduce Takin TTS, a neural codec language model that builds upon an enhanced neural speech codec and a multi-task training framework, capable of generating high-fidelity natural speech in a zero-shot way. For Takin VC, we advocate an effective content and timbre joint modeling approach to improve the speaker similarity, while advocating for a conditional flow matching based decoder to further enhance its naturalness and expressiveness. Last, we propose the Takin Morphing system with highly decoupled and advanced timbre and prosody modeling approaches, which enables individuals to customize speech production with their preferred timbre and prosody in a precise and controllable manner. Extensive experiments validate the effectiveness and robustness of our Takin AudioLLM series models. For detailed demos, please refer to <a class="link-external link-https" href="https://everest-ai.github.io/takinaudiollm/" rel="external noopener nofollow">this https URL</a>.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Enhancement of Zero-Shot Voice Generation Technology**: With the development of big data and large language models, zero-shot personalized rapid customization has become an important trend. The paper proposes the Takin AudioLLM series of technologies (including Takin TTS, Takin VC, and Takin Morphing), specifically designed for audiobook production. These models are capable of zero-shot voice generation, producing high-quality voices that are almost indistinguishable from real human voices, and allow users to customize voice content according to their own needs. 2. **Improving the Naturalness and Expressiveness of Speech Synthesis**: - **Takin TTS**: Proposes a neural coding language model based on enhanced neural speech codecs and a multi-task training framework, capable of zero-shot high-quality voice generation. - **Takin VC**: Advocates an effective joint modeling method of content and timbre to improve speaker similarity, and employs a conditional flow matching decoder to further enhance naturalness and expressiveness. - **Takin Morphing**: Introduces highly decoupled and advanced timbre and prosody modeling methods, enabling users to precisely and controllably customize voice generation. 3. **Meeting Diverse Application Needs**: Through the Takin AudioLLM series models, not only is the development of speech synthesis technology promoted, but the growing demand for personalized audiobook production is also addressed, allowing users to precisely customize voice generation to meet diverse speech synthesis application scenarios ranging from entertainment to business. In summary, this paper is dedicated to enhancing the quality and flexibility of speech synthesis through technological innovation to better serve practical applications.