Abstract:There are rich data resources residing in available materials websites, and most of these data resources are shown in the form of HTML tables. However, it is difficult to distinguish the attributes and values because of the semi-structured feature of HTML tables. Therefore, identifying attributes in HTML tables is the key issue for the information acquisition. In this paper, based on sibling comparison, a method for materials knowledge extraction from HTML tables is proposed, which consists of three steps: acquiring sibling tables, identifying table pattern and extracting table data. We show how to use F-measure to find the appropriate thresholds for matching of tables from materials websites when acquiring sibling tables. Further, we propose a strategy named FRFC (i.e. the First Row matching and First Column matching) to distinguish attributes and values, so that table pattern is identified. Moreover, the data from HTML tables is extracted based on their corresponding table patterns and mapped to a predefined schema, which will facilitate the population to materials ontology. The proposed approach is applicable to circumstances, where an attribute in the table may span multiple cells and matched attributes in sibling tables are more. We acquire desired accuracy (> 90%) through using FRFC for identifying table pattern. The time about extraction may not increase significantly with increasing number of documents and cells in tables, so our approach is effective to process a large number of documents. A prototype named MTES is developed and demonstrates the effectiveness of our proposed approach.

Study on Tables Information Extraction Based on Web

Extracting Information from Ontology-based WEB Table

Extracting information from WEB tables based on abstract semantic model

The Web information extraction technology research based on XML description

Automatically Extracting Local Ontologies Via HTML Tables

Research on Automatic Extraction Technology of Web Information

Web mining of relations from XML and construct database schema

Extraction and integration information in HTML tables

Web Table Extraction, Retrieval and Augmentation: A Survey

Extracting Web table information in cooperative learning activities based on abstract semantic model

Automating the extraction of data from HTML tables with unknown structure

A Rule-Based Information Extraction System for Human-Readable Semi-Structured Scientific Documents

Automatically Extraction of Semantic Hierarchical Structures from HTML Tables

Location Technology of Non-standardized Table Based on DOM Tree

A human-machine method for web table understanding

Research on Key Technologies of Web Mining Based on XML

An ontology-based Web information extraction approach

The Study of the Web Information Extraction System Based on Ontology

Ontology-Based Two-Phase Semi-Automatic Web Extracting

Managing Knowledge on the Web - Extracting Ontology from Html Web

A Method for Materials Knowledge Extraction from HTML Tables Based on Sibling Comparison