Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

Jill P. Naiman,Morgan G. Cosillo,Peter K. G. Williams,Alyssa Goodman

2023-09-21

Abstract:Scientific articles published prior to the "age of digitization" (~1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the arXiv we create, to the authors' knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. We provide baseline models trained with this dataset and find the mean improvement in character and word error rates of 7.71% and 18.82% for historical OCR text, respectively. When used to classify parts of sentences as inline math, we find a classification F1 score of 77.82%. Interactive dashboards to explore the dataset are available online: <a class="link-external link-https" href="https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023" rel="external noopener nofollow">this https URL</a>, and data and code, within the limitations of our agreement with the arXiv, are hosted on GitHub: <a class="link-external link-https" href="https://github.com/ReadingTimeMachine/ocr_post_correction" rel="external noopener nofollow">this https URL</a>.

Digital Libraries,Instrumentation and Methods for Astrophysics

What problem does this paper attempt to address?

This paper focuses on improving the text error problem after optical character recognition (OCR) in historical scientific literature. The authors propose a method to generate a large-scale synthetic dataset for training OCR error correction models by mining the source files (mainly LaTeX/TeX files) from arχiv. This dataset contains over 200 million character pairs and is currently the largest known synthetic OCR error correction dataset in the scientific domain. In the extraction of text from historical astronomical literature (NASA Astrophysics Data System ADS), errors frequently occur due to imperfect OCR technology, which affects the understanding of the text and downstream natural language processing tasks. To address this issue, the researchers created a synthetic dataset using the source files from arχiv and performed OCR posterior correction using the Tesseract OCR engine on the previously extracted OCR text. The paper describes a workflow that includes obtaining source files from arχiv, building a segmentation model for article paragraphs, annotating LaTeX documents to create "ground truth" (SGT) words, and then performing OCR on the pages. By aligning the OCR results with the SGT, the authors developed an algorithm to correct OCR errors. Experimental results demonstrate that using a byt5-based model significantly improves the correction of character and word error rates. Although this work mainly focuses on astronomical literature, its methods can be extended to other scientific fields. To facilitate future research, all code and data are provided in Python and GitHub, and interactive visualization tools are available to explore the dataset.

Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

Advancing Post-OCR Correction: A Comparative Study of Synthetic Data

OCR Post Correction for Endangered Language Texts

OCR4all -- An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

A Tool for Facilitating OCR Postediting in Historical Documents

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents

OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus

Toward the Optimized Crowdsourcing Strategy for OCR Post-Correction

Scrambled text: training Language Models to correct OCR errors using synthetic data

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

Improving OCR Quality in 19th Century Historical Documents Using a Combined Machine Learning Based Approach

EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge

Toward a Period-specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

Neural OCR Post-Hoc Correction of Historical Corpora

Post-OCR Text Correction for Bulgarian Historical Documents

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

Efficient OCR for Building a Diverse Digital History

Statistical Learning for OCR Text Correction

Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs