Abstract:BackgroundThe Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.ResultsOur paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.ConclusionsLA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/.

ICDAR 2021 Competition on Scientific Literature Parsing

Qlarify: Recursively Expandable Abstracts for Directed Information Retrieval over Scientific Papers

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing

VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups

Challenges and Advances in Information Extraction from Scientific Literature: a Review

SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding

ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

A Survey on Challenges and Advances in Natural Language Processing with a Focus on Legal Informatics and Low-Resource Languages

Pipelines for Procedural Information Extraction from Scientific Literature: Towards Recipes using Machine Learning and Data Science

SciDaSynth: Interactive Structured Knowledge Extraction and Synthesis from Scientific Literature with Large Language Model

Understanding the Logical and Semantic Structure of Large Documents

PP-StructureV2: A Stronger Document Analysis System

ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models

CSL: A Large-scale Chinese Scientific Literature Dataset

Investigating Large Language Models and Control Mechanisms to Improve Text Readability of Biomedical Abstracts

Layout-aware text extraction from full-text PDF of scientific articles

Scientific document processing: challenges for modern learning methods

Object Recognition from Scientific Document based on Compartment Refinement Framework

Leveraging Code to Improve In-context Learning for Semantic Parsing

Structured information extraction from complex scientific text with fine-tuned large language models