FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications

Hao-Han Guo,Kun Liu,Fei-Yu Shen,Yi-Chen Wu,Feng-Long Xie,Kun Xie,Kai-Tuo Xu
2024-09-05
Abstract:This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS dataset with rich annotations and a wide coverage of content, speaking style, and timbre. Then, we propose a language-model-based foundation TTS system. The speech signal is compressed into discrete semantic tokens via a semantic-aware speech tokenizer, and can be generated by a language model from the prompt text and audio. Then, a two-stage waveform generator is proposed to decode them to the high-fidelity waveform. We present two applications of this system: voice cloning for dubbing and human-like speech generation for chatbots. The experimental results demonstrate the solid in-context learning capability of FireRedTTS, which can stably synthesize high-quality speech consistent with the prompt text and audio. For dubbing, FireRedTTS can clone target voices in a zero-shot way for the UGC scenario and adapt to studio-level expressive voice characters in the PUGC scenario via few-shot fine-tuning with 1-hour recording. Moreover, FireRedTTS achieves controllable human-like speech generation in a casual style with paralinguistic behaviors and emotions via instruction tuning, to better serve spoken chatbots.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the increasing demand for personalization and diverse generation in current text - to - speech synthesis (TTS) technology. Specifically, with the popularization and development of AI products, such as virtual assistants, chatbots, and video dubbing application scenarios, users have put forward higher requirements for TTS systems, hoping that these systems can provide more personalized and diverse voice generation services. To meet this need, the paper proposes a basic TTS framework based on the language model - FireRedTTS, aiming to support industrial - level generative voice applications. The following are the specific problems that this paper attempts to solve: 1. **Personalized and diverse voice generation**: Traditional TTS systems are usually only able to generate voices in a fixed style, lacking flexibility and diversity. FireRedTTS improves the system's personalization and diversity capabilities by introducing large - scale high - quality data sets and advanced language models, enabling it to generate voices in different styles according to different scenarios and user needs. 2. **High - quality voice synthesis**: To ensure that the generated voices have high fidelity and natural fluency, FireRedTTS adopts a two - stage waveform generator to gradually decode from discrete semantic tags into high - quality audio waveforms. 3. **Zero - shot and few - shot adaptation**: In practical applications, especially in user - generated content (UGC) and professional user - generated content (PUGC) scenarios, the system needs to be able to quickly adapt to new speakers or voice styles under limited data conditions. FireRedTTS achieves this through zero - shot learning and few - shot fine - tuning, significantly improving the system's performance in these scenarios. 4. **Controllable human - like conversation generation**: To make applications such as chatbots more interactive and natural, FireRedTTS also introduces an instruction - tuning mechanism, so that the generated voices can contain emotions and paralinguistic behaviors, thereby better simulating human conversations. In summary, the main goal of this paper is to solve the deficiencies of existing TTS systems in terms of personalization, diversity, and high - quality voice generation by constructing a powerful basic TTS framework, while also improving their adaptability and controllability in practical applications.