Abstract:Tabular data is an abundant source of information on the Web, but remains mostly isolated from the latter's interconnections since tables lack links and computer-accessible descriptions of their structure. In other words, the schemas of these tables -- attribute names, values, data types, etc. -- are not explicitly stored as table metadata. Consequently, the structure that these tables contain is not accessible to the crawlers that power search engines and thus not accessible to user search queries. We address this lack of structure with a new method for leveraging the principles of table construction in order to extract table schemas. Discovering the schema by which a table is constructed is achieved by harnessing the similarities and differences of nearby table rows through the use of a novel set of features and a feature processing scheme. The schemas of these data tables are determined using a classification technique based on conditional random fields in combination with a novel feature encoding method called logarithmic binning, which is specifically designed for the data table extraction task. Our method provides considerable improvement over the well-known WebTables schema extraction method. In contrast with previous work that focuses on extracting individual relations, our method excels at correctly interpreting full tables, thereby being capable of handling general tables such as those found in spreadsheets, instead of being restricted to HTML tables as is the case with the WebTables method. We also extract additional schema characteristics, such as row groupings, which are important for supporting information retrieval tasks on tabular data.

Extraction and integration information in HTML tables

Capturing Semantic Hierarchies to Perform Meaningful Integration in HTML Tables

Automatically Extraction of Semantic Hierarchical Structures from HTML Tables

Study on Tables Information Extraction Based on Web

Automatically Extracting Local Ontologies Via HTML Tables

Automating the extraction of data from HTML tables with unknown structure

Extracting information from WEB tables based on abstract semantic model

Extracting Web table information in cooperative learning activities based on abstract semantic model

A Rule-Based Information Extraction System for Human-Readable Semi-Structured Scientific Documents

Extracting Linked Data from HTML Tables.

A Method for Materials Knowledge Extraction from HTML Tables Based on Sibling Comparison

Analysis and Interpretation of Semantic HTML Tables

Data Extraction from the Web Based on Pre—Defined Schema

Automatic Deep Web Table Segmentation By Domain Ontology

A human-machine method for web table understanding

Managing Knowledge on the Web - Extracting Ontology from Html Web

Schema extraction for tabular data on the web

Extracting Information from Ontology-based WEB Table

Ontology-based HTML to XML conversion

An approach for deep web interface schema extraction based on hierarchical semantic annotation

Location Technology of Non-standardized Table Based on DOM Tree