Abstract:The explosion of data over the past twenty years has fostered a huge amount of research in processing semi-structured documents like HTML and XML documents on Web. Nevertheless, the explosion of semi-structured documents that originate from outside the Web domain is more challenging. The data of semi-structured documents are everywhere: in scientific research reports, official journals, electronic health records and any records from other domains. Different from HTML and XML, semi-structured record is usually human-readable and has its own internal schema. On account of the incredible increase of data volumes, the traditional methods of studying on these data cannot meet the needs of high performance and flexible access. Relational database technique is a good alternative technique for managing and organizing data. In this paper, we aim at (1) revealing the semantic structure in human-readable scientific records (HSR), and (2) presenting a transformation system that converts HSR into structured relational database. Our approach identifies the relevant logical sub-structures and objects in HSR according to specific templates. We don't target "the best" methods but provide an available framework for facilitating the semi-structured HSR analysis tasks. The evaluation to the system shows its ability to enable the efficient transformation of HSR and therefore contributes to the increasing corpus of semi-structured HSR documents.

Extraction and Integration of Intensive Web Information Based on XML

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Web mining of relations from XML and construct database schema

The Web information extraction technology research based on XML description

Attributes extraction of Deep Web query interface based on DOM

Web Information Segmentation Method Based on DOM Structure Tree

Extraction Rule Language for Web Information Extraction and Integration

Research on Automated Web Navigation and Data Integration Rules for Web Infor-mation Extraction

Tag Tree Template for Web Information and Schema Extraction.

Study of Data Extraction Based on XML

A Rule-Based Information Extraction System for Human-Readable Semi-Structured Scientific Documents

Application and Design of Web Information Extraction System Based on Pattern Discovery

XML Based Exchange and Integration Techniques of Heterogeneous Data with Implementation

Building Web Information Integration Systems

Feature Matrix Extraction And Classification Of Xml Pages

A web data integration technique based on XML

A XML-Based Approach for Information Search

Web Information Extraction Based on Repeated Pattern

Data Extraction from the Web Based on Pre—Defined Schema

Web data integration technology based on XML

Ontology-Based Two-Phase Semi-Automatic Web Extracting