Extraction and integration information in HTML tables

Shijun Li,Zhiyong Peng,Mengchi Liu
DOI: https://doi.org/10.1109/CIT.2004.1357214
2004-01-01
Abstract:A large amount of information available on the Web is formatted in HTML tables, which are mainly presentation-oriented and are not suited for database applications. As a result, how to capture information in HTML tables semantically and integrate relevant information is a challenge. In this paper, we present a new approach that automatically captures the semantic hierarchies of HTML tables, and semi-automatically integrates HTML tables. It first automatically captures the attribute-value pairs in HTML tables by normalization, and introduces the notion of eigenvalue in formatting information to recognize the headings of HTML tables. After generating the global concepts and global schema manually by defining what data to be integrated, it then learns the lexical semantic set for each global concept, the contexts via labelling the attributes of example HTML tables to their corresponding global concept. Finally, it integrates the data of each source HTML table using the lexical semantic sets and the contexts to eliminate the conflicts and solve the nondeterministic problems in mapping each source schema to the global schema.
What problem does this paper attempt to address?