Abstract:Within the past few decades we have witnessed digital revolution, which moved scholarly communication to electronic media and also resulted in a substantial increase in its volume. Nowadays keeping track with the latest scientific achievements poses a major challenge for the researchers. Scientific information overload is a severe problem that slows down scholarly communication and knowledge propagation across the academia. Modern research infrastructures facilitate studying scientific literature by providing intelligent search tools, proposing similar and related documents, visualizing citation and author networks, assessing the quality and impact of the articles, and so on. In order to provide such high quality services the system requires the access not only to the text content of stored documents, but also to their machine-readable metadata. Since in practice good quality metadata is not always available, there is a strong demand for a reliable automatic method of extracting machine-readable metadata directly from source documents. This research addresses these problems by proposing an automatic, accurate and flexible algorithm for extracting wide range of metadata directly from scientific articles in born-digital form. Extracted information includes basic document metadata, structured full text and bibliography section. Designed as a universal solution, proposed algorithm is able to handle a vast variety of publication layouts with high precision and thus is well-suited for analyzing heterogeneous document collections. This was achieved by employing supervised and unsupervised machine-learning algorithms trained on large, diverse datasets. The evaluation we conducted showed good performance of proposed metadata extraction algorithm. The comparison with other similar solutions also proved our algorithm performs better than competition for most metadata types.

A Rule-Based Framework of Metadata Extraction from Scientific Papers

Extracting method knowledge elements from scientific literature: A rule‐based approach

Rule Based Metadata Extraction Framework from Academic Articles

Metadata Extraction for Scientific Papers

A Rule-Based Information Extraction System for Human-Readable Semi-Structured Scientific Documents

Automatic Document Metadata Extraction Based on Deep Networks.

Metadata Extraction System for Chinese Books

New Methods for Metadata Extraction from Scientific Literature

Metadata Extracting for HTML Document Based on Rules

PKUSpace: A Collaborative Platform for Scientific Researching

Summarization of Automatic Metadata Extraction Researchin China from 2001 to 2008

An Agent based Approach towards Metadata Extraction, Modelling and Information Retrieval over the Web

Object Recognition from Scientific Document based on Compartment Refinement Framework

Metadata for Scientific Experiment Reporting: A Case Study in Metal-Organic Frameworks

Extraction Knowledge Objects in Scientific Web Resource for Research Profiling

Effective Metadata Extraction from Irregularly Structured Web Content

Discovering Patterns of Definitions and Methods from Scientific Documents

An Automatic Keyphrase Extraction System for Scientific Documents

Understanding the Semantics in Reference Linkages: an Ontological Approach for Scientific Digital Libraries

A New Algorithm for the Acquisition of Knowledge from Scientific Literature in Specific Fields Based on Natural Language Comprehension.

Extraction Rule Language for Web Information Extraction and Integration