ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios

Yuyue Wang,Huan Xiao,Yihan Wu,Ruihua Song
2023-05-20
Abstract:Text to Speech (TTS) models can generate natural and high-quality speech, but it is not expressive enough when synthesizing speech with dramatic expressiveness, such as stand-up comedies. Considering comedians have diverse personal speech styles, including personal prosody, rhythm, and fillers, it requires real-world datasets and strong speech style modeling capabilities, which brings challenges. In this paper, we construct a new dataset and develop ComedicSpeech, a TTS system tailored for the stand-up comedy synthesis in low-resource scenarios. First, we extract prosody representation by the prosody encoder and condition it to the TTS model in a flexible way. Second, we enhance the personal rhythm modeling by a conditional duration predictor. Third, we model the personal fillers by introducing comedian-related special tokens. Experiments show that ComedicSpeech achieves better expressiveness than baselines with only ten-minute training data for each comedian. The audio samples are available at <a class="link-external link-https" href="https://xh621.github.io/stand-up-comedy-demo/" rel="external noopener nofollow">this https URL</a>
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to address the problem of synthesizing stand-up comedy speech with strong personal style in low-resource scenarios. Specifically, although existing text-to-speech (TTS) models can generate natural, high-quality speech, they perform poorly when synthesizing speech that requires high expressiveness, such as stand-up comedy speeches. This is because stand-up comedians usually have unique personal speech styles, including individual intonation, rhythm, and filler words, which pose higher demands on TTS systems. The main contributions of the paper include: 1. **Constructing a new dataset**: Due to the lack of real-world stand-up comedy speech datasets, the authors constructed a multi-speaker stand-up comedy speech dataset. 2. **Developing the ComedicSpeech system**: A TTS system specifically designed for stand-up comedy speech synthesis was proposed, improving the modeling of personal speech style through the following three aspects: - **Personal intonation modeling**: A carefully designed intonation encoder is used to extract personal intonation, and personal intonation information is integrated into the TTS model through conditional layer normalization. - **Personal rhythm modeling**: A conditional duration predictor is used to learn the speaker-dependent phoneme duration distribution, enhancing the modeling of personal rhythm. - **Personal filler word modeling**: Special tokens are introduced to represent personal filler words, simulating the personal speaking habits of stand-up comedians. Experimental results show that ComedicSpeech, using only 10 minutes of training data per speaker, outperforms baseline models in terms of speech quality, speech similarity, and expressiveness.