Abstract:Tabular data is an abundant source of information on the Web, but remains mostly isolated from the latter's interconnections since tables lack links and computer-accessible descriptions of their structure. In other words, the schemas of these tables -- attribute names, values, data types, etc. -- are not explicitly stored as table metadata. Consequently, the structure that these tables contain is not accessible to the crawlers that power search engines and thus not accessible to user search queries. We address this lack of structure with a new method for leveraging the principles of table construction in order to extract table schemas. Discovering the schema by which a table is constructed is achieved by harnessing the similarities and differences of nearby table rows through the use of a novel set of features and a feature processing scheme. The schemas of these data tables are determined using a classification technique based on conditional random fields in combination with a novel feature encoding method called logarithmic binning, which is specifically designed for the data table extraction task. Our method provides considerable improvement over the well-known WebTables schema extraction method. In contrast with previous work that focuses on extracting individual relations, our method excels at correctly interpreting full tables, thereby being capable of handling general tables such as those found in spreadsheets, instead of being restricted to HTML tables as is the case with the WebTables method. We also extract additional schema characteristics, such as row groupings, which are important for supporting information retrieval tasks on tabular data.

Extracting Knowledge from Web Tables Based on DOM Tree Similarity.

Mining RDF from Tables in Chinese Encyclopedias

Web Information Segmentation Method Based on DOM Structure Tree

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Analysis and Implementation of Extraction Algorithm of Web Hierarchy Structure

Extracting information from WEB tables based on abstract semantic model

Extracting Web table information in cooperative learning activities based on abstract semantic model

Web mining of relations from XML and construct database schema

Simplified DOM Trees for Transferable Attribute Extraction from the Web

Schema extraction for tabular data on the web

Web-scale extraction of structured data

Web Table Extraction, Retrieval and Augmentation: A Survey

Automating the extraction of data from HTML tables with unknown structure

Automatically Extraction of Semantic Hierarchical Structures from HTML Tables

DOM-Based Automatic Extraction of Topical Information from Web Pages

Domain-Specific Knowledge Base Enrichment Using Wikipedia Tables

Tag Tree Template for Web Information and Schema Extraction.

The Technology of Extracting Content Information from Web Page Based on DOM Tree

Exploiting Wikipedia As External Knowledge For Document Clustering

Extracting Academic Information from Conference Web Pages

A human-machine method for web table understanding