Transcribe, Align and Segment: Creating speech datasets for low-resource languages

Taras Sereda
2024-06-18
Abstract:In this work, we showcase a cost-effective method for generating training data for speech processing tasks. First, we transcribe unlabeled speech using a state-of-the-art Automatic Speech Recognition (ASR) model. Next, we align generated transcripts with the audio and apply segmentation on short utterances. Our focus is on ASR for low-resource languages, such as Ukrainian, using podcasts as a source of unlabeled speech. We release a new dataset UK-PODS that features modern conversational Ukrainian language. It contains over 50 hours of text audio-pairs as well as uk-pods-conformer, a 121 M parameters ASR model that is trained on MCV-10 and UK-PODS and achieves 3x reduction of Word Error Rate (WER) on podcasts comparing to publically available uk-nvidia-citrinet while maintaining comparable WER on MCV-10 test split. Both dataset UK-PODS <a class="link-external link-https" href="https://huggingface.co/datasets/taras-sereda/uk-pods" rel="external noopener nofollow">this https URL</a> and ASR uk-pods-conformer <a class="link-external link-https" href="https://huggingface.co/taras-sereda/uk-pods-conformer" rel="external noopener nofollow">this https URL</a> are available on the hugging-face hub.
Audio and Speech Processing
What problem does this paper attempt to address?