Abstract:At a time when the quantity of - more or less freely - available data is increasing significantly, thanks to digital corpora, editions or libraries, the development of data mining tools or deep learning methods allows researchers to build a corpus of study tailored for their research, to enrich their data and to exploit <a class="link-external link-http" href="http://them.Open" rel="external noopener nofollow">this http URL</a> optical character recognition (OCR) tools can be adapted to old prints, incunabula or even manuscripts, with usable results, allowing the rapid creation of textual corpora. The alternation of training and correction phases makes it possible to improve the quality of the results by rapidly accumulating raw text data. These can then be structured, for example in XML/TEI, and <a class="link-external link-http" href="http://enriched.The" rel="external noopener nofollow">this http URL</a> enrichment of the texts with graphic or linguistic annotations can also be automated. These processes, known to linguists and functional for modern languages, present difficulties for languages such as Medieval Occitan, due in part to the absence of big enough lemmatized corpora. Suggestions for the creation of tools adapted to the considerable spelling variation of ancient languages will be presented, as well as experiments for the lemmatization of Medieval and Premodern <a class="link-external link-http" href="http://Occitan.These" rel="external noopener nofollow">this http URL</a> techniques open the way for many exploitations. The much desired increase in the amount of available quality texts and data makes it possible to improve digital philology methods, if everyone takes the trouble to make their data freely available online and <a class="link-external link-http" href="http://reusable.By" rel="external noopener nofollow">this http URL</a> exposing different technical solutions and some micro-analyses as examples, this paper aims to show part of what digital philology can offer to researchers in the Occitan domain, while recalling the ethical issues on which such practices are based.

Strategies for managing time and costs in speech corpus creation: insights from the Slovenian ARTUR corpus

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

Polish Read Speech Corpus for Speech Tools and Services

Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates

Design of a Speech Corpus for Research on Cross-Lingual Prosody Transfer

Slovenian parliamentary corpus siParl

Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish

CLASSLA-Express: a Train of CLARIN.SI Workshops on Language Resources and Tools with Easily Expanding Route

Phonetic Segmentation of the UCLA Phonetics Lab Archive

Political corpus creation through automatic speech recognition on EU debates

Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach

Praaline: Integrating Tools for Speech Corpus Research

Transcribe, Align and Segment: Creating speech datasets for low-resource languages

A Semi-Automatic Approach to Create Large Gender- and Age-Balanced Speaker Corpora: Usefulness of Speaker Diarization & Identification

Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

Spoken Language Translation for Polish

CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese

Šolar, the developmental corpus of Slovene

A Survey of Resources and Methods for Natural Language Processing of Serbian Language

MediaSpeech: Multilanguage ASR Benchmark and Dataset

Producing Corpora of Medieval and Premodern Occitan