Abstract:Speech datasets available in the public domain are often underutilized because of challenges in discoverability and interoperability. A comprehensive framework has been designed to survey, catalog, and curate available speech datasets, which allows replicable evaluation of automatic speech recognition (ASR) systems. A case study focused on the Polish language was conducted; the framework was applied to curate more than 24 datasets and evaluate 25 combinations of ASR systems and models. This research constitutes the most extensive comparison to date of both commercial and free ASR systems for the Polish language. It draws insights from 600 system-model-test set evaluations, marking a significant advancement in both scale and comprehensiveness. The results of surveys and performance comparisons are available as interactive dashboards (<a class="link-external link-https" href="https://huggingface.co/spaces/amu-cai/pl-asr-leaderboard" rel="external noopener nofollow">this https URL</a>) along with curated datasets (<a class="link-external link-https" href="https://huggingface.co/datasets/amu-cai/pl-asr-bigos-v2" rel="external noopener nofollow">this https URL</a>, <a class="link-external link-https" href="https://huggingface.co/datasets/pelcra/pl-asr-pelcra-for-bigos" rel="external noopener nofollow">this https URL</a>) and the open challenge call (<a class="link-external link-https" href="https://poleval.pl/tasks/task3" rel="external noopener nofollow">this https URL</a>). Tools used for evaluation are open-sourced (<a class="link-external link-https" href="https://github.com/goodmike31/pl-asr-bigos-tools" rel="external noopener nofollow">this https URL</a>), facilitating replication and adaptation for other languages, as well as continuous expansion with new datasets and systems.

MediaSpeech: Multilanguage ASR Benchmark and Dataset

Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish

ASR Benchmarking: Need for a More Representative Conversational Dataset

Towards measuring fairness in speech recognition: Fair-Speech dataset

Speech Robust Bench: A Robustness Benchmark For Speech Recognition

Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach

Svarah: Evaluating English ASR Systems on Indian Accents

Anatomy of Industrial Scale Multilingual ASR

WER We Stand: Benchmarking Urdu ASR Models

SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation

Towards Measuring Fairness in Speech Recognition: Casual Conversations Dataset Transcriptions

Speech Wikimedia: A 77 Language Multilingual Speech Dataset

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

CL-MASR: A Continual Learning Benchmark for Multilingual ASR

Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages

AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR

A Reference-less Quality Metric for Automatic Speech Recognition via Contrastive-Learning of a Multi-Language Model with Self-Supervision

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language

MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages