Abstract:Up until today research in various educational and linguistic domains such as learner corpus research, writing research, or second language acquisition has produced a substantial amount of research data in the form of L1 and L2 learner corpora. However, the multitude of individual solutions combined with domain-inherent obstacles in data sharing have so far hampered comparability, reusability and reproducibility of data and research results. In this article, we present work in creating a digital infrastructure for L1 and L2 learner corpora and populating it with data collected in the past. We embed our infrastructure efforts in the broader field of infrastructures for scientific research, drawing from technical solutions and frameworks from research data management, among which the FAIR guiding principles for data stewardship. We share our experiences from integrating some L1 and L2 learner corpora from concluded projects into the infrastructure while trying to ensure compliance with the FAIR principles and the standards we established for reproducibility, discussing how far research data that has been collected in the past can be made comparable, reusable and reproducible. Our results show that some basic needs for providing comparable and reusable data are covered by existing general infrastructure solutions and can be exploited for domain-specific infrastructures such as the one presented in this article. Other aspects need genuinely domain-driven approaches. The solutions found for the corpora in the presented infrastructure can only be a preliminary attempt, and further community involvement would be needed to provide templates and models acknowledged and promoted by the community. Furthermore, forward-looking data management would be needed starting from the beginning of new corpus creation projects to ensure that all requirements for FAIR data can be met.

Leipzig Corpus Miner - A Text Mining Infrastructure for Qualitative Data Analysis

iLCM - A Virtual Research Infrastructure for Large-Scale Qualitative Data

A German Corpus for Fine-Grained Named Entity Recognition and Relation Extraction of Traffic and Industry Events

A survey of methods to ease the development of highly multilingual text mining applications

CorpusVis: Visual Analysis of Digital Sheet Music Collections

An automated domain-independent text reading, interpreting and extracting approach for reviewing the scientific literature

Mining Asymmetric Intertextuality

A Corpus for Automatic Readability Assessment and Text Simplification of German

Curatr: A Platform for Semantic Analysis and Curation of Historical Literary Texts

QuaLLM: An LLM-based Framework to Extract Quantitative Insights from Online Forums

Quantitative approaches to content analysis: identifying conceptual drift across publication outlets

Providing Digital Infrastructure for Audio-Visual Linguistic Research Data with Diverse Usage Scenarios: Lessons Learnt

Random matrix ensembles of time-lagged correlation matrices: Derivation of eigenvalue spectra and analysis of financial time-series

MinerU: An Open-Source Solution for Precise Document Content Extraction

Analyzing social media data: A mixed-methods framework combining computational and qualitative text analysis

How We Do Things With Words: Analyzing Text as Social and Cultural Data

Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora

Text data mining and data quality management for research information systems in the context of open data and open science

Legal aspects of text mining

Author Name Co-Mention Analysis: Testing a Poor Man's Author Co-Citation Analysis Method

In Search of Meaning: Lessons, Resources and Next Steps for Computational Analysis of Financial Discourse