Abstract:Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two main challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech. In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice variability) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech language understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompts based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality text prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is available online.

What problem does this paper attempt to address?

The paper aims to address the issue of voice variability modeling in Text-to-Speech (TTS) systems, particularly how to use text prompts to describe and generate diverse speech. The paper focuses on solving two main challenges: 1. **One-to-Many Problem**: The same text prompt may correspond to multiple different speech samples because the detailed variations in speech cannot be fully expressed through text prompts. This increases the difficulty of model training, leading to potential overfitting or mode collapse. 2. **Data Scale Challenge**: It is difficult to construct a dataset of text prompts that describe voice characteristics, as such text prompts are rarely found online. Typically, these text prompts need to be written by specialized personnel, which is both time-consuming and costly. To address the above issues, the paper proposes the PromptTTS 2 system, which includes the following key components: - **Variability Network**: Used to predict the voice variation information not fully captured by the text prompts. By leveraging the help of reference speech, the variability network is trained to predict the voice variation representations contained in the reference speech. During inference, the characteristics of the synthesized speech can be controlled by sampling from Gaussian noise, providing greater flexibility for the user. - **Text Prompt Generation Pipeline**: Automatically generates high-quality text prompts for speech. This pipeline first uses a Speech Language Understanding (SLU) model to identify attributes in the speech (such as gender, speed, etc.), and then uses a Large Language Model (LLM) to write text prompts based on the results of these attributes. This method eliminates the need for manually writing text prompts, reducing costs. Experimental results show that compared to existing methods, PromptTTS 2 can generate speech that is more consistent with the text prompts and supports controlling diverse speech variations by sampling from Gaussian noise, providing users with more options. Additionally, the text prompt generation pipeline can produce high-quality text prompts, avoiding the high cost of manual annotation.

PromptTTS 2: Describing and Generating Voices with Text Prompt

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

PromptSpeaker: Speaker Generation Based on Text Descriptions

SpeechPrompt V2: Prompt Tuning for Speech Classification Tasks

PromptST: Abstract Prompt Learning for End-to-End Speech Translation

PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions

PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

Voice Attribute Editing with Text Prompt

Promptor: A Conversational and Autonomous Prompt Generation Agent for Intelligent Text Entry Techniques

SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks

POS: A Prompts Optimization Suite for Augmenting Text-to-Video Generation

Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models

VoiceLDM: Text-to-Speech with Environmental Context

MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis

Building speech corpus with diverse voice characteristics for its prompt-based representation

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt