Abstract:Within the past few decades we have witnessed digital revolution, which moved scholarly communication to electronic media and also resulted in a substantial increase in its volume. Nowadays keeping track with the latest scientific achievements poses a major challenge for the researchers. Scientific information overload is a severe problem that slows down scholarly communication and knowledge propagation across the academia. Modern research infrastructures facilitate studying scientific literature by providing intelligent search tools, proposing similar and related documents, visualizing citation and author networks, assessing the quality and impact of the articles, and so on. In order to provide such high quality services the system requires the access not only to the text content of stored documents, but also to their machine-readable metadata. Since in practice good quality metadata is not always available, there is a strong demand for a reliable automatic method of extracting machine-readable metadata directly from source documents. This research addresses these problems by proposing an automatic, accurate and flexible algorithm for extracting wide range of metadata directly from scientific articles in born-digital form. Extracted information includes basic document metadata, structured full text and bibliography section. Designed as a universal solution, proposed algorithm is able to handle a vast variety of publication layouts with high precision and thus is well-suited for analyzing heterogeneous document collections. This was achieved by employing supervised and unsupervised machine-learning algorithms trained on large, diverse datasets. The evaluation we conducted showed good performance of proposed metadata extraction algorithm. The comparison with other similar solutions also proved our algorithm performs better than competition for most metadata types.

Metadata Extraction System for Chinese Books

Automatic Document Metadata Extraction Based on Deep Networks.

Family-Oriented Personalized Digital Publishing System

Mining RDF from Tables in Chinese Encyclopedias

Citation Metadata Extraction Via Deep Neural Network-based Segment Sequence Labeling

An Approach to Auto-detection, Segmentation and Tagging of Bibliographic Metadata

Metadata Extraction for Scientific Papers

Cebbip: A Parser Of Bibliographic Information In Chinese Electronic Books

Structure extraction from PDF-based book documents.

Digitizing On Chinese Ancient Books: Information Extraction And Retrieval

Summarization of Automatic Metadata Extraction Researchin China from 2001 to 2008

Metadata designing and realizing in land use management information system.

Searching online book documents and analyzing book citations

oriented metadata enrichment: A case study

New Methods for Metadata Extraction from Scientific Literature

Review-Oriented Metadata Enrichment: A Case Study

Comprehensive Global Typography Extraction System for Electronic Book Documents

Automatic content based title extraction for Chinese documents using support vector machine

Rule Based Metadata Extraction Framework from Academic Articles

Realization of Automatic Cataloging of Chinese Mongraphs

Construction and Knowledge Mining of Traditional Chinese Medicine Ancient Books Bibliographic Abstracts Database Based on Genetic Algorithm and BP Neural Network