Using XPath to Discover Informative Content Blocks of Web Pages

Yan Fu,Dongqing Yang,Shiwei Tang,Tengjiao Wang,Jun Gao
DOI: https://doi.org/10.1109/skg.2007.106
2007-01-01
Abstract:Web pages usually contain various contents, which are relevant or irrelevant with the main topic. We define rele- vant contents as informative content blocks, whereas irrele- vant contents as clutters. Clutters intend to mislead search engines, or trigger an artificially high link-based ranking for specific target pages. So cleaning Web pages before mining becomes critical for improving performance of tra- ditional information retrieval. Here, we propose a method to discover informative content block without supervision. Initially, using a set of sample pages, we adopt a series of rules to distinguish informative content blocks from clutters. Then we generalize public XPath for informative content blocks or clutters, and apply it to similar pages. We have implemented our method in five different Web sites, and out- put more simpler and centralized HTML file. Experimental result shows that our method can obtain informative con- tent blocks of Web page accurately. And another advantage of our approach is that it is completely automatic.
What problem does this paper attempt to address?