Abstract:In text-to-speech synthesis, the ability to control voice characteristics is vital for various applications. By leveraging thriving text prompt-based generation techniques, it should be possible to enhance the nuanced control of voice characteristics. While previous research has explored the prompt-based manipulation of voice characteristics, most studies have used pre-recorded speech, which limits the diversity of voice characteristics available. Thus, we aim to address this gap by creating a novel corpus and developing a model for prompt-based manipulation of voice characteristics in text-to-speech synthesis, facilitating a broader range of voice characteristics. Specifically, we propose a method to build a sizable corpus pairing voice characteristics descriptions with corresponding speech samples. This involves automatically gathering voice-related speech data from the Internet, ensuring its quality, and manually annotating it using crowdsourcing. We implement this method with Japanese language data and analyze the results to validate its effectiveness. Subsequently, we propose a construction method of the model to retrieve speech from voice characteristics descriptions based on a contrastive learning method. We train the model using not only conservative contrastive learning but also feature prediction learning to predict quantitative speech features corresponding to voice characteristics. We evaluate the model performance via experiments with the corpus we constructed above.

JSUT and JVS: Free Japanese voice corpora for accelerating speech synthesis research

JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

JVS-MuSiC: Japanese multispeaker singing-voice corpus

JSSS: free Japanese speech corpus for summarization and simplification

PJS: phoneme-balanced Japanese singing voice corpus

J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis

J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent

JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification

Building speech corpus with diverse voice characteristics for its prompt-based representation

JVNV: A Corpus of Japanese Emotional Speech With Verbal Content and Nonverbal Expressions

Construction of a Large-scale Japanese ASR Corpus on TV Recordings

Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control

SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark

Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing

JNV Corpus: A Corpus of Japanese Nonverbal Vocalizations with Diverse Phrases and Emotions

Developing a Multi-Platform Speech Recording System Toward Open Service of Building Large-Scale Speech Corpora

Common Voice: A Massively-Multilingual Speech Corpus

VoiceBank-2023: A Multi-Speaker Mandarin Speech Corpus for Constructing Personalized TTS Systems for the Speech Impaired

Building a Large Japanese Web Corpus for Large Language Models