Abstract:This paper introduces an open source speech dataset, KeSpeech, which involves 1,542 hours of speech signals recorded by 27,237 speakers in 34 cities in China, and the pronunciation includes standard Mandarin and its 8 subdialects. The new dataset possesses several properties. Firstly, the dataset provides multiple labels including content transcription, speaker identity and subdialect, hence supporting a variety of speech processing tasks, such as speech recognition, speaker recognition, and subdialect identification, as well as other advanced techniques like multi-task learning and conditional learning. Secondly, some of the text samples were parallel recorded with both the standard Mandarin and a particular subdialect, allowing for new applications such as subdialect style conversion. Thirdly, the number of speakers is much larger than other open-source datasets, making it suitable for tasks that require training data from vast speakers. Finally, the speech signals were recorded in two phases, which opens the opportunity for the study of the time variance property of human speech. We present the design principle of the KeSpeech dataset and four baseline systems based on the new data resource: speech recognition, speaker verification, subdialect identification and voice conversion. The dataset is free for all academic usage.

Speech Wikimedia: A 77 Language Multilingual Speech Dataset

The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

MLS: A Large-Scale Multilingual Dataset for Speech Research

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

MediaSpeech: Multilanguage ASR Benchmark and Dataset

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

Common Voice: A Massively-Multilingual Speech Corpus

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages

MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages

A large-scale multimodal dataset of human speech recognition

Hi-Fi Multi-Speaker English TTS Dataset

A Recorded Debating Dataset

LibriVoxDeEn: A Corpus for German-to-English Speech Translation and German Speech Recognition

Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates

Speech Resources in the Tamasheq Language

Scaling Speech Technology to 1,000+ Languages

KeSpeech: an Open Source Speech Dataset of Mandarin and Its Eight Subdialects.

SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage