Measuring Similarity of Web Pages on Maximum Isomorphic Subtree

Zhenyu Hu,Fuchun Sun
DOI: https://doi.org/10.1109/fskd.2010.5569792
2010-01-01
Abstract:This paper studies the problem of comparing or looking for structured data in DOM trees. The proposed notion of structure descriptor of ordered tree fully represents the structure information of a DOM tree in a serialized style, indicating an efficient method to convert a DOM tree into its node sequence. Based on this notion, this paper produced an algorithm to measure the similarity of two web pages, by looking for maximum isomorphic subtrees in the serialized node sequences. When used to compare two web pages, the algorithm has the time complexity of O(n2), while used to look for certain structured object from a web page, its complexity reaches O(n). Experimental results using a number of well known web pages from diverse domains show that the proposed technique is able to identify similar structured objects very accurately.
What problem does this paper attempt to address?