A boosted semi-supervised learning framework for web page filtering

Zhu He,Xi Li,Weiming Hu
DOI: https://doi.org/10.1109/ICSMC.2009.5346290
2009-01-01
Abstract:The World Wide Web provides great convenience for users to obtain information. However, there exists much harmful information on the Internet, such as pornographic content and prohibited drugs' information. Thus, how to filter harmful Web pages on the Internet is quite an important issue. In general, the problem of harmful Web page filtering is converted to that of Web page classification, which needs plenty of well labeled training samples. However, the cost of labeling a large set of Web pages is very expensive. To address this problem, we adopt a semi-supervised framework for Web page filtering. In this framework, each Web page is represented by bags of different features, extracted using its HTML structure. Then a semi-supervised learning strategy is taken for efficiently obtaining well labeled training samples. Finally, a boosting classifier is utilized for harmful Web page filtering. Experiments have demonstrated the effectiveness of our framework.
What problem does this paper attempt to address?