Optimizing a Data Science System for Text Reuse Analysis

Ananth Mahadevan,Michael Mathioudakis,Eetu Mäkelä,Mikko Tolonen
2024-01-14
Abstract:Text reuse is a methodological element of fundamental importance in humanities research: pieces of text that re-appear across different documents, verbatim or paraphrased, provide invaluable information about the historical spread and evolution of ideas. Large modern digitized corpora enable the joint analysis of text collections that span entire centuries and the detection of large-scale patterns, impossible to detect with traditional small-scale analysis. For this opportunity to materialize, it is necessary to develop efficient data science systems that perform the corresponding analysis tasks.
Databases
What problem does this paper attempt to address?