Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou,Jiawei Chen,Jitong Chen,Yuanzhe Chen,Zhuo Chen,Ziyi Chen,Jian Cong,Lelai Deng,Chuang Ding,Lu Gao,Mingqing Gong,Peisong Huang,Qingqing Huang,Zhiying Huang,Yuanyuan Huo,Dongya Jia,Chumin Li,Feiya Li,Hui Li,Jiaxin Li,Xiaoyang Li,Xingxing Li,Lin Liu,Shouda Liu,Sichao Liu,Xudong Liu,Yuchen Liu,Zhengxi Liu,Lu Lu,Junjie Pan,Xin Wang,Yuping Wang,Yuxuan Wang,Zhen Wei,Jian Wu,Chao Yao,Yifeng Yang,Yuanhao Yi,Junteng Zhang,Qidi Zhang,Shuo Zhang,Wenjie Zhang,Yang Zhang,Zilin Zhao,Dejian Zhong,Xiaobin Zhuang

2024-06-04

Abstract:We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named $\text{Seed-TTS}_\text{DiT}$, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, $\text{Seed-TTS}_\text{DiT}$ does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at \url{<a class="link-external link-https" href="https://bytedancespeech.github.io/seedtts_tech_report" rel="external noopener nofollow">this https URL</a>}.

Audio and Speech Processing,Sound

What problem does this paper attempt to address?

This paper introduces a large-scale autoregressive text-to-speech (TTS) model family called Seed-TTS, aiming to generate high-quality, versatile speech that is almost indistinguishable from human speech. As a baseline model, Seed-TTS demonstrates excellent performance in speech synthesis and context learning, particularly achieving a level of speaker similarity and naturalness comparable to real human speech. Subjective ratings of these metrics are further improved through fine-tuning. The model possesses good controllability and can generate expressive and diverse out-of-domain speaker speech. Additionally, the paper proposes self-distillation methods for speech decomposition and reinforcement learning techniques to enhance model robustness, speaker similarity, and controllability. A non-autoregressive (NAR) version of the Seed-TTS model, called Seed-TTS DiT, is also introduced, which uses a fully autoregressive architecture and does not rely on estimated phoneme durations, achieving end-to-end speech generation. The research shows that this variant performs comparably to the language model baseline methods and demonstrates utility in speech editing tasks. The paper discusses the potential applications, limitations, and challenges encountered during the development of Seed-TTS, highlighting the multimedia and safety issues that must be considered when building responsible artificial intelligence.

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

SR-TTS: a rhyme-based end-to-end speech synthesis system

DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer

NaturalSpeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion