Abstract:The explosion of data over the past twenty years has fostered a huge amount of research in processing semi-structured documents like HTML and XML documents on Web. Nevertheless, the explosion of semi-structured documents that originate from outside the Web domain is more challenging. The data of semi-structured documents are everywhere: in scientific research reports, official journals, electronic health records and any records from other domains. Different from HTML and XML, semi-structured record is usually human-readable and has its own internal schema. On account of the incredible increase of data volumes, the traditional methods of studying on these data cannot meet the needs of high performance and flexible access. Relational database technique is a good alternative technique for managing and organizing data. In this paper, we aim at (1) revealing the semantic structure in human-readable scientific records (HSR), and (2) presenting a transformation system that converts HSR into structured relational database. Our approach identifies the relevant logical sub-structures and objects in HSR according to specific templates. We don't target "the best" methods but provide an available framework for facilitating the semi-structured HSR analysis tasks. The evaluation to the system shows its ability to enable the efficient transformation of HSR and therefore contributes to the increasing corpus of semi-structured HSR documents.

Extracting Content For News Web Pages Based On Dom

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Extracting Web Content by Exploiting Multi-Category Characteristics

Analysis and Implementation of Extraction Algorithm of Web Hierarchy Structure

Web Information Segmentation Method Based on DOM Structure Tree

DOM-based Content Extraction of HTML Documents

The Technology of Extracting Content Information from Web Page Based on DOM Tree

DOM-Based Automatic Extraction of Topical Information from Web Pages

A Rule-Based Information Extraction System for Human-Readable Semi-Structured Scientific Documents

Web Content Extraction & Its Data Management Method

Automatic Data Extraction from Web Discussion Forums

Solution for Automatic Web Review Extraction

Automatic Elements Extraction of Chinese Web News Using Prior Information of Content and Structure

News Information Extraction for Web Resource

Learning to Extract Web News Title in Template Independent Way

Method of Collecting and Analyzing News Pages on Internet

Information extraction for eb resource

NLP based intelligent news search engine using information extraction from e-newspapers

Research of Vertical Search Engine in News Industry

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

Relevant Data Node Extraction:A Web Data Extraction Method for Non Contagious Data