Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

Jill P. Naiman,Morgan G. Cosillo,Peter K. G. Williams,Alyssa Goodman
2023-09-21
Abstract:Scientific articles published prior to the "age of digitization" (~1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the arXiv we create, to the authors' knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. We provide baseline models trained with this dataset and find the mean improvement in character and word error rates of 7.71% and 18.82% for historical OCR text, respectively. When used to classify parts of sentences as inline math, we find a classification F1 score of 77.82%. Interactive dashboards to explore the dataset are available online: <a class="link-external link-https" href="https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023" rel="external noopener nofollow">this https URL</a>, and data and code, within the limitations of our agreement with the arXiv, are hosted on GitHub: <a class="link-external link-https" href="https://github.com/ReadingTimeMachine/ocr_post_correction" rel="external noopener nofollow">this https URL</a>.
Digital Libraries,Instrumentation and Methods for Astrophysics
What problem does this paper attempt to address?
This paper focuses on improving the text error problem after optical character recognition (OCR) in historical scientific literature. The authors propose a method to generate a large-scale synthetic dataset for training OCR error correction models by mining the source files (mainly LaTeX/TeX files) from arχiv. This dataset contains over 200 million character pairs and is currently the largest known synthetic OCR error correction dataset in the scientific domain. In the extraction of text from historical astronomical literature (NASA Astrophysics Data System ADS), errors frequently occur due to imperfect OCR technology, which affects the understanding of the text and downstream natural language processing tasks. To address this issue, the researchers created a synthetic dataset using the source files from arχiv and performed OCR posterior correction using the Tesseract OCR engine on the previously extracted OCR text. The paper describes a workflow that includes obtaining source files from arχiv, building a segmentation model for article paragraphs, annotating LaTeX documents to create "ground truth" (SGT) words, and then performing OCR on the pages. By aligning the OCR results with the SGT, the authors developed an algorithm to correct OCR errors. Experimental results demonstrate that using a byt5-based model significantly improves the correction of character and word error rates. Although this work mainly focuses on astronomical literature, its methods can be extended to other scientific fields. To facilitate future research, all code and data are provided in Python and GitHub, and interactive visualization tools are available to explore the dataset.