Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia

Natallia Kokash,Giovanni Colavizza
2024-06-28
Abstract:Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives. Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any given Wikipedia dump in the cloud-based settings. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 European languages so that they are parsed and mapped into a generic structured citation template. This paper presents our open-source software pipeline to retrieve, classify, and disambiguate citations on demand from a given Wikipedia dump.
Digital Libraries
What problem does this paper attempt to address?
The paper mainly addresses the following issues: 1. **Development of a reusable pipeline**: To extract and process citation data from multilingual versions of Wikipedia, the authors developed an open-source software pipeline. This pipeline can handle any given Wikipedia data dump, run in a cloud environment, and convert citation templates from different language versions into a unified English format. 2. **Improving the quality and reliability of Wikipedia citation data**: The paper aims to enhance the quality and reliability of citation data through automated processes for classification and identifier lookup. This includes identifying which citations point to reliable sources such as journal articles, books, or news reports. 3. **Integration with open science infrastructure**: Through this project, the authors hope to promote the integration between Wikipedia and open science infrastructure, making Wikipedia a more reliable source of information and enhancing its role as a tool for scientific research. 4. **Providing comprehensive datasets**: The paper mentions the extraction of a large amount of citation data from the English Wikipedia, and these datasets continue to grow over time. Additionally, it covers various other language versions of Wikipedia, providing the possibility for cross-language comparisons. In summary, the goal of this paper is to improve and standardize the process of extracting citation information from Wikipedia to support broader academic research and societal applications.