Abstract:Within the past few decades we have witnessed digital revolution, which moved scholarly communication to electronic media and also resulted in a substantial increase in its volume. Nowadays keeping track with the latest scientific achievements poses a major challenge for the researchers. Scientific information overload is a severe problem that slows down scholarly communication and knowledge propagation across the academia. Modern research infrastructures facilitate studying scientific literature by providing intelligent search tools, proposing similar and related documents, visualizing citation and author networks, assessing the quality and impact of the articles, and so on. In order to provide such high quality services the system requires the access not only to the text content of stored documents, but also to their machine-readable metadata. Since in practice good quality metadata is not always available, there is a strong demand for a reliable automatic method of extracting machine-readable metadata directly from source documents. This research addresses these problems by proposing an automatic, accurate and flexible algorithm for extracting wide range of metadata directly from scientific articles in born-digital form. Extracted information includes basic document metadata, structured full text and bibliography section. Designed as a universal solution, proposed algorithm is able to handle a vast variety of publication layouts with high precision and thus is well-suited for analyzing heterogeneous document collections. This was achieved by employing supervised and unsupervised machine-learning algorithms trained on large, diverse datasets. The evaluation we conducted showed good performance of proposed metadata extraction algorithm. The comparison with other similar solutions also proved our algorithm performs better than competition for most metadata types.

Metadata Extracting for HTML Document Based on Rules

Extracting method knowledge elements from scientific literature: A rule‐based approach

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Mining RDF from Tables in Chinese Encyclopedias

Web Information Segmentation Method Based on DOM Structure Tree

A Rule-Based Information Extraction System for Human-Readable Semi-Structured Scientific Documents

Rule Based Metadata Extraction Framework from Academic Articles

Web Information Extraction Using Ontology and Rule Expression

Extraction Rule Language for Web Information Extraction and Integration

Automatic Document Metadata Extraction Based on Deep Networks.

Metadata Extraction System for Chinese Books

Metadata Extraction for Scientific Papers

Effective Metadata Extraction from Irregularly Structured Web Content

Metadata designing and realizing in land use management information system.

Excel information extraction based on XML metadata and schema

Research on Automated Web Navigation and Data Integration Rules for Web Infor-mation Extraction

Extracting information from WEB tables based on abstract semantic model

Content Extraction of Web Pages Based on Characteristic Symbols

Rule-based information extraction for mechanical-electrical-plumbing-specific semantic web

New Methods for Metadata Extraction from Scientific Literature

Automatically Extracting Local Ontologies Via HTML Tables