WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Linhan Ma,Dake Guo,Kun Song,Yuepeng Jiang,Shuai Wang,Liumeng Xue,Weiming Xu,Huan Zhao,Binbin Zhang,Lei Xie

2024-06-19

Abstract:With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12,800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.

Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the lack of large-scale high-quality datasets in the field of Chinese Text-to-Speech (TTS). Specifically, although there are currently some large-scale English TTS datasets (such as the 60,000 hours of English speech data used by VALL-E and the 44,000 hours of English speech data used by NaturalSpeech 2), in comparison, Chinese TTS datasets are smaller in scale and lack diversity. For example, the largest open-source Chinese speech dataset, DIDISPEECH, only contains about 800 hours of read-style speech data, which is far from sufficient for training large-scale TTS models. To address this shortfall, the research team started with the existing large-scale Automatic Speech Recognition (ASR) dataset WenetSpeech and created a new large-scale Chinese TTS dataset, WenetSpeech4TTS, through a series of processing steps, including merging adjacent segments, extending boundaries, enhancing audio quality, multi-speaker detection, speech recognition, and quality filtering. This dataset contains 12,800 hours of paired audio-text data and is divided into different subsets based on quality scores to support the training and fine-tuning of TTS models of various scales. Additionally, the research team conducted experimental validation using the VALL-E and NaturalSpeech 2 systems on these subsets, demonstrating the effectiveness and benchmark performance of WenetSpeech4TTS.

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

The WISTON Text to Speech System for Blizzard 2008

A Miniature Chinese TTS System Based on Tailored Corpus

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Text-To-Speech Synthesis In The Wild

AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

VoiceBank-2023: A Multi-Speaker Mandarin Speech Corpus for Constructing Personalized TTS Systems for the Speech Impaired

MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset

MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023

The NTU-AISG Text-to-speech System for Blizzard Challenge 2020

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Design of Speech Corpus for Mandarin Text to Speech

KeSpeech: an Open Source Speech Dataset of Mandarin and Its Eight Subdialects.

NaturalSpeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis

SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design