ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios

Yuyue Wang,Huan Xiao,Yihan Wu,Ruihua Song

2023-05-20

Abstract:Text to Speech (TTS) models can generate natural and high-quality speech, but it is not expressive enough when synthesizing speech with dramatic expressiveness, such as stand-up comedies. Considering comedians have diverse personal speech styles, including personal prosody, rhythm, and fillers, it requires real-world datasets and strong speech style modeling capabilities, which brings challenges. In this paper, we construct a new dataset and develop ComedicSpeech, a TTS system tailored for the stand-up comedy synthesis in low-resource scenarios. First, we extract prosody representation by the prosody encoder and condition it to the TTS model in a flexible way. Second, we enhance the personal rhythm modeling by a conditional duration predictor. Third, we model the personal fillers by introducing comedian-related special tokens. Experiments show that ComedicSpeech achieves better expressiveness than baselines with only ten-minute training data for each comedian. The audio samples are available at <a class="link-external link-https" href="https://xh621.github.io/stand-up-comedy-demo/" rel="external noopener nofollow">this https URL</a>

Sound,Artificial Intelligence,Audio and Speech Processing

What problem does this paper attempt to address?

This paper attempts to address the problem of synthesizing stand-up comedy speech with strong personal style in low-resource scenarios. Specifically, although existing text-to-speech (TTS) models can generate natural, high-quality speech, they perform poorly when synthesizing speech that requires high expressiveness, such as stand-up comedy speeches. This is because stand-up comedians usually have unique personal speech styles, including individual intonation, rhythm, and filler words, which pose higher demands on TTS systems. The main contributions of the paper include: 1. **Constructing a new dataset**: Due to the lack of real-world stand-up comedy speech datasets, the authors constructed a multi-speaker stand-up comedy speech dataset. 2. **Developing the ComedicSpeech system**: A TTS system specifically designed for stand-up comedy speech synthesis was proposed, improving the modeling of personal speech style through the following three aspects: - **Personal intonation modeling**: A carefully designed intonation encoder is used to extract personal intonation, and personal intonation information is integrated into the TTS model through conditional layer normalization. - **Personal rhythm modeling**: A conditional duration predictor is used to learn the speaker-dependent phoneme duration distribution, enhancing the modeling of personal rhythm. - **Personal filler word modeling**: Special tokens are introduced to represent personal filler words, simulating the personal speaking habits of stand-up comedians. Experimental results show that ComedicSpeech, using only 10 minutes of training data per speaker, outperforms baseline models in terms of speech quality, speech similarity, and expressiveness.

ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios

Modeling of Rakugo Speech and Its Limitations: Toward Speech Synthesis That Entertains Audiences

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

Adaptive Text to Speech for Spontaneous Style

Comic-guided speech synthesis

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

Generative Expressive Conversational Speech Synthesis

SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description

Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis

Ensemble prosody prediction for expressive speech synthesis

Prosody-controllable spontaneous TTS with neural HMMs

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading