Discovering Informative Contents of Web Pages.

Qifeng Fan,Chunwei Yan,Lifu Huang,Lian'en Huang
DOI: https://doi.org/10.1007/978-3-319-08010-9_20
2014-01-01
Abstract:The World Wide Web has become a huge information repository. However, besides informative contents, the Web pages also contain redundant contents, which are considered harmful for Web mining and searching systems. In this paper, we propose a new approach to discover informative contents from a set of Web pages within a single Web site. Our method works as follows: First, we propose a newly designed Site Style Tree, to capture the common presentation styles and the actual contents of the pages in the given Web site. The tree structure, which is different from the one formerly proposed, is built by aligning pages of the site. For each node of SST, informative contents are discovered based on entropy and threshold method. The proposed approach is evaluated with two mining tasks, Web page clustering and classification. The experimental performance shows a significant improvement when compared to previous template detection approaches.
What problem does this paper attempt to address?