Abstract:At a time when the quantity of - more or less freely - available data is increasing significantly, thanks to digital corpora, editions or libraries, the development of data mining tools or deep learning methods allows researchers to build a corpus of study tailored for their research, to enrich their data and to exploit <a class="link-external link-http" href="http://them.Open" rel="external noopener nofollow">this http URL</a> optical character recognition (OCR) tools can be adapted to old prints, incunabula or even manuscripts, with usable results, allowing the rapid creation of textual corpora. The alternation of training and correction phases makes it possible to improve the quality of the results by rapidly accumulating raw text data. These can then be structured, for example in XML/TEI, and <a class="link-external link-http" href="http://enriched.The" rel="external noopener nofollow">this http URL</a> enrichment of the texts with graphic or linguistic annotations can also be automated. These processes, known to linguists and functional for modern languages, present difficulties for languages such as Medieval Occitan, due in part to the absence of big enough lemmatized corpora. Suggestions for the creation of tools adapted to the considerable spelling variation of ancient languages will be presented, as well as experiments for the lemmatization of Medieval and Premodern <a class="link-external link-http" href="http://Occitan.These" rel="external noopener nofollow">this http URL</a> techniques open the way for many exploitations. The much desired increase in the amount of available quality texts and data makes it possible to improve digital philology methods, if everyone takes the trouble to make their data freely available online and <a class="link-external link-http" href="http://reusable.By" rel="external noopener nofollow">this http URL</a> exposing different technical solutions and some micro-analyses as examples, this paper aims to show part of what digital philology can offer to researchers in the Occitan domain, while recalling the ethical issues on which such practices are based.

Etiqueter un corpus oral par apprentissage automatique à l'aide de connaissances linguistiques

Temperature Effects on Mechanical Properties of Zinc Dithiophosphate Tribofilms

Standardizing linguistic data: method and tools for annotating (pre-orthographic) French

Applying Cooperative Machine Learning to Speed Up the Annotation of Social Signals in Large Multi-modal Corpora

Multilabel classification of medical concepts for patient clinical profile identification

Un système modulaire d'acquisition automatique de traductions à partir du Web

DisMo: A Morphosyntactic, Disfluency and Multi-Word Unit Annotator. An Evaluation on a Corpus of French Spontaneous and Read Speech

An open-source voice type classifier for child-centered daylong recordings

ASDA : Analyseur Syntaxique du Dialecte Alg{é}rien dans un but d'analyse s{é}mantique

Objets Sonores: Une Représentation Bio-Inspirée Hiérarchique Parcimonieuse À Très Grandes Dimensions Utilisable En Reconnaissance; Auditory Objects: Bio-Inspired Hierarchical Sparse High Dimensional Representation for Recognition

Corpus and Models for Lemmatisation and POS-tagging of Old French

Language-Agnostic Syllabification with Neural Sequence Labeling

Predicting CEFRL levels in learner English on the basis of metrics and full texts

Unsupervised ASR via Cross-Lingual Pseudo-Labeling

New Semantic Task for the French Spoken Language Understanding MEDIA Benchmark

Automated Utterance Labeling of Conversations Using Natural Language Processing

MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning

Label distribution learning for compound facial expression recognition in‐the‐wild: A comparative study

Producing Corpora of Medieval and Premodern Occitan

Establishing a New State-of-the-Art for French Named Entity Recognition

Efficient Spoken Language Recognition via Multilabel Classification