Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Sijing Chen,Yuan Feng,Laipeng He,Tianwei He,Wendi He,Yanni Hu,Bin Lin,Yiting Lin,Yu Pan,Pengfei Tan,Chengwei Tian,Chen Wang,Zhicheng Wang,Ruoye Xie,Jixun Yao,Quanlei Yan,Yuguang Yang,Jianhao Ye,Jingjing Yin,Yanzhen Yu,Huimin Zhang,Xiang Zhang,Guangcheng Zhao,Hongbin Zhou,Pengpeng Zou

2024-09-24

Abstract:With the advent of the big data and large language model era, zero-shot personalized rapid customization has emerged as a significant trend. In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production. These models are capable of zero-shot speech production, generating high-quality speech that is nearly indistinguishable from real human speech and facilitating individuals to customize the speech content according to their own needs. Specifically, we first introduce Takin TTS, a neural codec language model that builds upon an enhanced neural speech codec and a multi-task training framework, capable of generating high-fidelity natural speech in a zero-shot way. For Takin VC, we advocate an effective content and timbre joint modeling approach to improve the speaker similarity, while advocating for a conditional flow matching based decoder to further enhance its naturalness and expressiveness. Last, we propose the Takin Morphing system with highly decoupled and advanced timbre and prosody modeling approaches, which enables individuals to customize speech production with their preferred timbre and prosody in a precise and controllable manner. Extensive experiments validate the effectiveness and robustness of our Takin AudioLLM series models. For detailed demos, please refer to <a class="link-external link-https" href="https://everest-ai.github.io/takinaudiollm/" rel="external noopener nofollow">this https URL</a>.

Sound,Artificial Intelligence,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Enhancement of Zero-Shot Voice Generation Technology**: With the development of big data and large language models, zero-shot personalized rapid customization has become an important trend. The paper proposes the Takin AudioLLM series of technologies (including Takin TTS, Takin VC, and Takin Morphing), specifically designed for audiobook production. These models are capable of zero-shot voice generation, producing high-quality voices that are almost indistinguishable from real human voices, and allow users to customize voice content according to their own needs. 2. **Improving the Naturalness and Expressiveness of Speech Synthesis**: - **Takin TTS**: Proposes a neural coding language model based on enhanced neural speech codecs and a multi-task training framework, capable of zero-shot high-quality voice generation. - **Takin VC**: Advocates an effective joint modeling method of content and timbre to improve speaker similarity, and employs a conditional flow matching decoder to further enhance naturalness and expressiveness. - **Takin Morphing**: Introduces highly decoupled and advanced timbre and prosody modeling methods, enabling users to precisely and controllably customize voice generation. 3. **Meeting Diverse Application Needs**: Through the Takin AudioLLM series models, not only is the development of speech synthesis technology promoted, but the growing demand for personalized audiobook production is also addressed, allowing users to precisely customize voice generation to meet diverse speech synthesis application scenarios ranging from entertainment to business. In summary, this paper is dedicated to enhancing the quality and flexibility of speech synthesis through technological innovation to better serve practical applications.

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens