Abstract:The explosion of data over the past twenty years has fostered a huge amount of research in processing semi-structured documents like HTML and XML documents on Web. Nevertheless, the explosion of semi-structured documents that originate from outside the Web domain is more challenging. The data of semi-structured documents are everywhere: in scientific research reports, official journals, electronic health records and any records from other domains. Different from HTML and XML, semi-structured record is usually human-readable and has its own internal schema. On account of the incredible increase of data volumes, the traditional methods of studying on these data cannot meet the needs of high performance and flexible access. Relational database technique is a good alternative technique for managing and organizing data. In this paper, we aim at (1) revealing the semantic structure in human-readable scientific records (HSR), and (2) presenting a transformation system that converts HSR into structured relational database. Our approach identifies the relevant logical sub-structures and objects in HSR according to specific templates. We don't target "the best" methods but provide an available framework for facilitating the semi-structured HSR analysis tasks. The evaluation to the system shows its ability to enable the efficient transformation of HSR and therefore contributes to the increasing corpus of semi-structured HSR documents.

Design of analysis system for documents based on web crawler

Design and Implementation of Crawler Program Based on Python

Design and Research of Web Crawler Based on Distributed Architecture

Design and Implementation of Engineering Standard Database System Based on Data Mining

Architectural Design and Evaluation of an Efficient Web-Crawling System

A Distributed Text Mining System for Online Web Textual Data Analysis

Application of Web Text Mining in Study Assistance

Design and Implementation of Craweper Based on Scrapy

Implementation of Web Data Mining Technology Based on Python

Web Crawler: Design And Implementation For Extracting Article-Like Contents

Study of Word-Based Chinese Document Experimental System and Chinese Free-Text Information Extraction Experiment Based on It

Method research and system design of automatic acquire recruitment information based on Internet

Web Content Extraction & Its Data Management Method

Summary of web crawler technology research

A Rule-Based Information Extraction System for Human-Readable Semi-Structured Scientific Documents

Implementation of Recruitment Website Data Analysis System Based on Web Crawler

The Design and Implementation of an Internet Public Opinion Monitoring and Analyzing System

Design and Analysis of a Report Tracing System Based on Webinfomall

The Application of Web Crawler in City Image Research

Application and Design of Web Information Extraction System Based on Pattern Discovery

A Web Crawler-based Consensus Analysis System for Cross-Border Products