Abstract:Speech datasets available in the public domain are often underutilized because of challenges in discoverability and interoperability. A comprehensive framework has been designed to survey, catalog, and curate available speech datasets, which allows replicable evaluation of automatic speech recognition (ASR) systems. A case study focused on the Polish language was conducted; the framework was applied to curate more than 24 datasets and evaluate 25 combinations of ASR systems and models. This research constitutes the most extensive comparison to date of both commercial and free ASR systems for the Polish language. It draws insights from 600 system-model-test set evaluations, marking a significant advancement in both scale and comprehensiveness. The results of surveys and performance comparisons are available as interactive dashboards (<a class="link-external link-https" href="https://huggingface.co/spaces/amu-cai/pl-asr-leaderboard" rel="external noopener nofollow">this https URL</a>) along with curated datasets (<a class="link-external link-https" href="https://huggingface.co/datasets/amu-cai/pl-asr-bigos-v2" rel="external noopener nofollow">this https URL</a>, <a class="link-external link-https" href="https://huggingface.co/datasets/pelcra/pl-asr-pelcra-for-bigos" rel="external noopener nofollow">this https URL</a>) and the open challenge call (<a class="link-external link-https" href="https://poleval.pl/tasks/task3" rel="external noopener nofollow">this https URL</a>). Tools used for evaluation are open-sourced (<a class="link-external link-https" href="https://github.com/goodmike31/pl-asr-bigos-tools" rel="external noopener nofollow">this https URL</a>), facilitating replication and adaptation for other languages, as well as continuous expansion with new datasets and systems.

ASR Bundestag: A Large-Scale political debate dataset in German

Political corpus creation through automatic speech recognition on EU debates

SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments

Open Source Automatic Speech Recognition for German

A Recorded Debating Dataset

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish

MediaSpeech: Multilanguage ASR Benchmark and Dataset

LibriVoxDeEn: A Corpus for German-to-English Speech Translation and German Speech Recognition

Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates

Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach

A Large-scale Dataset for Audio-Language Representation Learning

A Greek Parliament Proceedings Dataset for Computational Linguistics and Political Analysis

Anatomy of Industrial Scale Multilingual ASR

wav2vec and its current potential to Automatic Speech Recognition in German for the usage in Digital History: A comparative assessment of available ASR-technologies for the use in cultural heritage contexts

ASR Benchmarking: Need for a More Representative Conversational Dataset

Using Kaldi for Automatic Speech Recognition of Conversational Austrian German

A Multimodal German Dataset for Automatic Lip Reading Systems and Transfer Learning

Audio Dialogues: Dialogues dataset for audio and music understanding

Speaker attribution in German parliamentary debates with QLoRA-adapted large language models