PromptTTS 2: Describing and Generating Voices with Text Prompt

Yichong Leng,Zhifang Guo,Kai Shen,Xu Tan,Zeqian Ju,Yanqing Liu,Yufei Liu,Dongchao Yang,Leying Zhang,Kaitao Song,Lei He,Xiang-Yang Li,Sheng Zhao,Tao Qin,Jiang Bian
2023-10-12
Abstract:Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two main challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech. In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice variability) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech language understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompts based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality text prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is available online.
Audio and Speech Processing,Computation and Language,Machine Learning,Sound
What problem does this paper attempt to address?
The paper aims to address the issue of voice variability modeling in Text-to-Speech (TTS) systems, particularly how to use text prompts to describe and generate diverse speech. The paper focuses on solving two main challenges: 1. **One-to-Many Problem**: The same text prompt may correspond to multiple different speech samples because the detailed variations in speech cannot be fully expressed through text prompts. This increases the difficulty of model training, leading to potential overfitting or mode collapse. 2. **Data Scale Challenge**: It is difficult to construct a dataset of text prompts that describe voice characteristics, as such text prompts are rarely found online. Typically, these text prompts need to be written by specialized personnel, which is both time-consuming and costly. To address the above issues, the paper proposes the PromptTTS 2 system, which includes the following key components: - **Variability Network**: Used to predict the voice variation information not fully captured by the text prompts. By leveraging the help of reference speech, the variability network is trained to predict the voice variation representations contained in the reference speech. During inference, the characteristics of the synthesized speech can be controlled by sampling from Gaussian noise, providing greater flexibility for the user. - **Text Prompt Generation Pipeline**: Automatically generates high-quality text prompts for speech. This pipeline first uses a Speech Language Understanding (SLU) model to identify attributes in the speech (such as gender, speed, etc.), and then uses a Large Language Model (LLM) to write text prompts based on the results of these attributes. This method eliminates the need for manually writing text prompts, reducing costs. Experimental results show that compared to existing methods, PromptTTS 2 can generate speech that is more consistent with the text prompts and supports controlling diverse speech variations by sampling from Gaussian noise, providing users with more options. Additionally, the text prompt generation pipeline can produce high-quality text prompts, avoiding the high cost of manual annotation.