WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Linhan Ma,Dake Guo,Kun Song,Yuepeng Jiang,Shuai Wang,Liumeng Xue,Weiming Xu,Huan Zhao,Binbin Zhang,Lei Xie
2024-06-19
Abstract:With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12,800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.
Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the lack of large-scale high-quality datasets in the field of Chinese Text-to-Speech (TTS). Specifically, although there are currently some large-scale English TTS datasets (such as the 60,000 hours of English speech data used by VALL-E and the 44,000 hours of English speech data used by NaturalSpeech 2), in comparison, Chinese TTS datasets are smaller in scale and lack diversity. For example, the largest open-source Chinese speech dataset, DIDISPEECH, only contains about 800 hours of read-style speech data, which is far from sufficient for training large-scale TTS models. To address this shortfall, the research team started with the existing large-scale Automatic Speech Recognition (ASR) dataset WenetSpeech and created a new large-scale Chinese TTS dataset, WenetSpeech4TTS, through a series of processing steps, including merging adjacent segments, extending boundaries, enhancing audio quality, multi-speaker detection, speech recognition, and quality filtering. This dataset contains 12,800 hours of paired audio-text data and is divided into different subsets based on quality scores to support the training and fine-tuning of TTS models of various scales. Additionally, the research team conducted experimental validation using the VALL-E and NaturalSpeech 2 systems on these subsets, demonstrating the effectiveness and benchmark performance of WenetSpeech4TTS.