Detecting and Monitoring Dynamic Content Blocks of a Web Page by Merging its Historical Versions ∗

Shu Tang,Zhicheng Dou,Xing Xie,Jun He
2014-01-01
Abstract:Nowadays, most people and organizations with websites design their own homepages to facilitate readers’ obtaining information about the entity in question. The content of these homepages is usually divided into different areas, each of which only contains information about one specific aspect. Some of these areas’ pieces of information are updated over time. It would be very convenient for browsers of the site if we can automatically detect dynamic information areas and trace their content. Previous studies have paid little attention to homepages, and have not made full use of pages’ historical information and conducted exploration in the temporal line. We build a merged tree from one page’s historical versions. We then use it to detect dynamic content blocks, and extract and trace their content. Experimental results based on a large number of Web pages from diverse domains show that the proposed technique is able to extract the dynamic content blocks with a high level of accuracy.
What problem does this paper attempt to address?