Strategies for managing time and costs in speech corpus creation: insights from the Slovenian ARTUR corpus

Darinka Verdonik,Andreja Bizjak,Andrej Žgank,Mirjam Sepesy Maučec,Mitja Trojar,Jerneja Žganec Gros,Marko Bajec,Iztok Lebar Bajec,Simon Dobrišek
DOI: https://doi.org/10.1007/s10579-024-09792-2
2024-12-01
Language Resources and Evaluation
Abstract:The paper details the creation of an open access speech corpus for a less-resourced language, covering the diversity in terms of accents, dialects, speech styles and demographic characteristics that exist in the target population. Three primary challenges are identified that impact the time and cost efficiency of such a speech corpus development significantly. These challenges are (1) Managing copyrights, personality rights and personal data protection; (2) Obtaining precise word-by-word transcriptions with segmentation into meaningful segments and annotation of speaker exchanges; (3) Managing the new collaborators needed for field recording and manual transcribing or the correction of transcriptions, along with a strictly-defined workflow and platform for data collection. Several strategies are proposed to address each of these challenges, and the experiences are described from the creation of the Slovenian ARTUR corpus regarding these challenges. The ARTUR corpus comprises 1,000 h of carefully selected recordings, with 880 h accompanied by precise, manually checked or manually created transcriptions. It is freely available in the CLARIN.SI repository under the CC BY-SA 4.0 licence. Part of its data was used to upgrade the Slovenian reference speech corpus GOS.
computer science, interdisciplinary applications
What problem does this paper attempt to address?