Template detection for large scale search engines.

Liang Chen,Shaozhi Ye,Xing Li
DOI: https://doi.org/10.1145/1141277.1141534
2006-01-01
Abstract:ABSTRACTTemplates in web sites hurt search engine retrieval performance, especially in content relevance and link analysis. Current template removal methods suffer from processing speed and scalability when dealing with large volume web pages. In this paper, we propose a novel two-stage template detection method, which combines template detection and removal with the index building process of a search engine. First, web pages are segmented into blocks and blocks are clustered according to their style features. Second, similar contents sharing the common layout style are detected during the index building process. The blocks with similar layout style and content are identified as templates and deleted. Our experiment on eight popular web sites shows that our method achieves 20-40% faster than shingle and SST methods with close accuracy.
What problem does this paper attempt to address?