Abstract:This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS dataset with rich annotations and a wide coverage of content, speaking style, and timbre. Then, we propose a language-model-based foundation TTS system. The speech signal is compressed into discrete semantic tokens via a semantic-aware speech tokenizer, and can be generated by a language model from the prompt text and audio. Then, a two-stage waveform generator is proposed to decode them to the high-fidelity waveform. We present two applications of this system: voice cloning for dubbing and human-like speech generation for chatbots. The experimental results demonstrate the solid in-context learning capability of FireRedTTS, which can stably synthesize high-quality speech consistent with the prompt text and audio. For dubbing, FireRedTTS can clone target voices in a zero-shot way for the UGC scenario and adapt to studio-level expressive voice characters in the PUGC scenario via few-shot fine-tuning with 1-hour recording. Moreover, FireRedTTS achieves controllable human-like speech generation in a casual style with paralinguistic behaviors and emotions via instruction tuning, to better serve spoken chatbots.

An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

NaturalSpeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality

Enhancing audio quality for expressive Neural Text-to-Speech

MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages

Text-To-Speech Synthesis In The Wild

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

TTS-by-TTS: TTS-Driven Data Augmentation for Fast and High-Quality Speech Synthesis

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset

FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications

Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS

Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning

Building African Voices

An overview of text-to-speech systems and media applications

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset

A Survey on Neural Speech Synthesis