Template extraction from candidate template set generation: a structure and content approach.

Hang Su,Qiaozhu Mei
DOI: https://doi.org/10.1145/1167253.1167303
2005-01-01
Abstract:This paper introduces a new approach of webpage template extraction. Unlike traditional methods which concern only content information, this paper considers both structure and content similarity. It uses natural table structure as content units instead of text blocks or pagelets. This paper novelly and formally defines the templates and other concepts. It introduces a new concept, candidate template, which is an intermediate level of abstract table structure. A candidate template only covers the most informative tables, and abstracts a large page set with similar structures. This paper proposes a novel approach of template extraction by solving three sub problems surrounding candidate template set. The involving of candidate template set solves the accuracy and efficiency problems of traditional approaches. This paper also introduces a new model for structural similarity, and for table informativeness based on six heuristics.
What problem does this paper attempt to address?