Historical Portuguese corpora: a survey

Tomás Freitas Osório,Henrique Lopes Cardoso
DOI: https://doi.org/10.1007/s10579-024-09757-5
2024-07-20
Language Resources and Evaluation
Abstract:This survey aims to thoroughly examine and evaluate the current landscape of electronic corpora in historical Portuguese. This is achieved through a comprehensive analysis of existing resources. The article makes two main contributions. The first is an exhaustive cataloguing of existing Portuguese historical corpora, where each corpus is meticulously detailed regarding linguistic periods, geographic origins, and thematic contents. The second contribution focuses on the digital accessibility of these corpora for researchers. These contributions are crucial in enhancing and progressing the study of historical corpora in the Portuguese language, laying a critical groundwork for future linguistic research in this field. Our survey identified 20 freely accessible corpora, comprising approximately 63.9 million tokens, and two private corpora, totalling 59.9 million tokens.
computer science, interdisciplinary applications
What problem does this paper attempt to address?