Abstract:Large pre-trained Automatic Speech Recognition (ASR) models have shown improved performance in low-resource languages due to the increased availability of benchmark corpora and the advantages of transfer learning. However, only a limited number of languages possess ample resources to fully leverage transfer learning. In such contexts, benchmark corpora become crucial for advancing methods. In this article, we introduce two new benchmark corpora designed for low-resource languages spoken in the Democratic Republic of the Congo: the Lingala Read Speech Corpus, with 4 h of labelled audio, and the Congolese Speech Radio Corpus, which offers 741 h of unlabelled audio spanning four significant low-resource languages of the region. During data collection, Lingala Read Speech recordings of thirty-two distinct adult speakers, each with a unique context under various settings with different accents, were recorded. Concurrently, Congolese Speech Radio raw data were taken from the archive of broadcast station, followed by a designed curation process. During data preparation, numerous strategies have been utilised for pre-processing the data. The datasets, which have been made freely accessible to all researchers, serve as a valuable resource for not only investigating and developing monolingual methods and approaches that employ linguistically distant languages but also multilingual approaches with linguistically similar languages. Using techniques such as supervised learning and self-supervised learning, they are able to develop inaugural benchmarking of speech recognition systems for Lingala and mark the first instance of a multilingual model tailored for four Congolese languages spoken by an aggregated population of 95 million. Moreover, two models were applied to this dataset. The first is supervised learning modelling and the second is for self-supervised pre-training.

Transcribe, Align and Segment: Creating speech datasets for low-resource languages

Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach

Automated Pipeline for Training Dataset Creation from Unlabeled Audios for Automatic Speech Recognition

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish

CrowdSpeech and VoxDIY: Benchmark Datasets for Crowdsourced Audio Transcription

Phonetic Segmentation of the UCLA Phonetics Lab Archive

Speech recognition datasets for low-resource Congolese languages

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

Subtitles to Segmentation: Improving Low-Resource Speech-to-Text Translation Pipelines

Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens

The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data

RoDia: A New Dataset for Romanian Dialect Identification from Speech

Universal Cross-Lingual Data Generation for Low Resource ASR

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

Transsion TSUP's speech recognition system for ASRU 2023 MADASR Challenge

Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Textless NLP -- Zero Resource Challenge with Low Resource Compute

MediaSpeech: Multilanguage ASR Benchmark and Dataset