Abstract:The explosion of data over the past twenty years has fostered a huge amount of research in processing semi-structured documents like HTML and XML documents on Web. Nevertheless, the explosion of semi-structured documents that originate from outside the Web domain is more challenging. The data of semi-structured documents are everywhere: in scientific research reports, official journals, electronic health records and any records from other domains. Different from HTML and XML, semi-structured record is usually human-readable and has its own internal schema. On account of the incredible increase of data volumes, the traditional methods of studying on these data cannot meet the needs of high performance and flexible access. Relational database technique is a good alternative technique for managing and organizing data. In this paper, we aim at (1) revealing the semantic structure in human-readable scientific records (HSR), and (2) presenting a transformation system that converts HSR into structured relational database. Our approach identifies the relevant logical sub-structures and objects in HSR according to specific templates. We don't target "the best" methods but provide an available framework for facilitating the semi-structured HSR analysis tasks. The evaluation to the system shows its ability to enable the efficient transformation of HSR and therefore contributes to the increasing corpus of semi-structured HSR documents.

Automatic Extraction of Semi-structured Web Data

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Web Entities Extraction Based on Semi-Structured Semantic Database.

The Web information extraction technology research based on XML description

Web Content Extraction & Its Data Management Method

Ontology-Based Two-Phase Semi-Automatic Web Extracting

Adaptively Extracting Structured Data from Web Pages

Data Extraction from the Web Based on Pre—Defined Schema

Automatic Data Extraction from Data-Rich Web Pages

EGA:An Algorithm for Automatic Semi-structured Web Documents Extraction

Semi-structured Data Extraction and Schema Knowledge Mining

A robust approach of automatic web data record extraction

A Rule-Based Information Extraction System for Human-Readable Semi-Structured Scientific Documents

Semi-structured data extraction and schema knowled

A dynamic learning framework to thoroughly extract structured data from web pages without human efforts

The research and implementation of web information extraction technology based on multi-level pages

ERE: Entity Relationship Extraction System Based on Semi-structured Web Pages

Analysis and Improvement of Data Extraction Technology on the Web

Research on Automatic Extraction Technology of Web Information

Managing Knowledge on the Web - Extracting Ontology from Html Web

Data extraction from web pages based on structural-semantic entropy.