Automatically Extraction of Semantic Hierarchical Structures from HTML Tables

Tiaojun Xiao
2007-01-01
Abstract:Existing approaches for extracting information from hyper text markup language (HTML) tables are incapable of processing complicated or nested tables.This paper presents an approach for extracting semantic hierarchical structures from complex HTML tables based on the four basic types of tables with a content tree used to depict the semantic hierarchical structure of the HTML table.The approach differentiates the attribute cells and value cells in the HTML table and divides the HTML table into basic tables to then construct content trees to extract the semantic hierarchical structure from the HTML table.Tests demonstrate that the approach can automatically analyze complex,nested tables with accurate results.
What problem does this paper attempt to address?