Abstract:The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the "chemistry-aware" natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text.The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.1c01198.Documents used for the evaluation of PDFDataExtractor and the corresponding scripts (ZIP)This article has not yet been cited by other publications.

Reference Metadata Extraction from Scientific Papers

Metadata Extraction for Scientific Papers

Mining RDF from Tables in Chinese Encyclopedias

Rule Based Metadata Extraction Framework from Academic Articles

Metadata Extraction System for Chinese Books

New Methods for Metadata Extraction from Scientific Literature

Automatic Document Metadata Extraction Based on Deep Networks.

Extracting method knowledge elements from scientific literature: A rule‐based approach

A Rule-Based Information Extraction System for Human-Readable Semi-Structured Scientific Documents

Summarization of Automatic Metadata Extraction Researchin China from 2001 to 2008

PDF articles metadata harvester

PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format

A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents

Enhancing keyphrase extraction from academic articles with their reference information

BibRank: Automatic Keyphrase Extraction Platform Using~Metadata

Keyphrases automatic extraction from the abstracts of English scientific papers based on Scopus retrieval

PKUSpace: A Collaborative Platform for Scientific Researching

Understanding the Semantics in Reference Linkages: an Ontological Approach for Scientific Digital Libraries

An Agent based Approach towards Metadata Extraction, Modelling and Information Retrieval over the Web

Extraction Knowledge Objects in Scientific Web Resource for Research Profiling

Discovering Patterns of Definitions and Methods from Scientific Documents