Abstract:This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS dataset with rich annotations and a wide coverage of content, speaking style, and timbre. Then, we propose a language-model-based foundation TTS system. The speech signal is compressed into discrete semantic tokens via a semantic-aware speech tokenizer, and can be generated by a language model from the prompt text and audio. Then, a two-stage waveform generator is proposed to decode them to the high-fidelity waveform. We present two applications of this system: voice cloning for dubbing and human-like speech generation for chatbots. The experimental results demonstrate the solid in-context learning capability of FireRedTTS, which can stably synthesize high-quality speech consistent with the prompt text and audio. For dubbing, FireRedTTS can clone target voices in a zero-shot way for the UGC scenario and adapt to studio-level expressive voice characters in the PUGC scenario via few-shot fine-tuning with 1-hour recording. Moreover, FireRedTTS achieves controllable human-like speech generation in a casual style with paralinguistic behaviors and emotions via instruction tuning, to better serve spoken chatbots.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the increasing demand for personalization and diverse generation in current text - to - speech synthesis (TTS) technology. Specifically, with the popularization and development of AI products, such as virtual assistants, chatbots, and video dubbing application scenarios, users have put forward higher requirements for TTS systems, hoping that these systems can provide more personalized and diverse voice generation services. To meet this need, the paper proposes a basic TTS framework based on the language model - FireRedTTS, aiming to support industrial - level generative voice applications. The following are the specific problems that this paper attempts to solve: 1. **Personalized and diverse voice generation**: Traditional TTS systems are usually only able to generate voices in a fixed style, lacking flexibility and diversity. FireRedTTS improves the system's personalization and diversity capabilities by introducing large - scale high - quality data sets and advanced language models, enabling it to generate voices in different styles according to different scenarios and user needs. 2. **High - quality voice synthesis**: To ensure that the generated voices have high fidelity and natural fluency, FireRedTTS adopts a two - stage waveform generator to gradually decode from discrete semantic tags into high - quality audio waveforms. 3. **Zero - shot and few - shot adaptation**: In practical applications, especially in user - generated content (UGC) and professional user - generated content (PUGC) scenarios, the system needs to be able to quickly adapt to new speakers or voice styles under limited data conditions. FireRedTTS achieves this through zero - shot learning and few - shot fine - tuning, significantly improving the system's performance in these scenarios. 4. **Controllable human - like conversation generation**: To make applications such as chatbots more interactive and natural, FireRedTTS also introduces an instruction - tuning mechanism, so that the generated voices can contain emotions and paralinguistic behaviors, thereby better simulating human conversations. In summary, the main goal of this paper is to solve the deficiencies of existing TTS systems in terms of personalization, diversity, and high - quality voice generation by constructing a powerful basic TTS framework, while also improving their adaptability and controllability in practical applications.

FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model

SR-TTS: a rhyme-based end-to-end speech synthesis system

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

AdaSpeech: Adaptive Text to Speech for Custom Voice

A unified front-end framework for English text-to-speech synthesis

A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis

High Fidelity Speech Synthesis with Adversarial Networks

LatentSpeech: Latent Diffusion for Text-To-Speech Generation

U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Audiobox: Unified Audio Generation with Natural Language Prompts

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

FastSpeech: Fast, Robust and Controllable Text to Speech