Classifier-Guided Topical Crawler: A Novel Method of Automatically Labeling the Positive URLs

Chen Li,Li Zhi-shu,Yu Zhong-hua,Han Guo-hui
DOI: https://doi.org/10.1109/skg.2009.60
2009-01-01
Abstract:It is a key factor for classifier-guided topical crawler to obtain labeled training samples. Recently, many such classifiers are trained with WebPages which are labeled manually or extracted from the Open Directory Project (ODP), and then the classifiers judge the topical relevance of WebPages pointed to by hyperlinks in the crawler frontier. Though one can obtain labeled WebPages with comparative ease, however, training the classifiers with WebPages violates the overall hypothesis of machine learning about i.i.d (Independent and Identical Distribution) between training and testing sets because the classification instances are hyperlinks (URLs) instead of WebPages. For the reason, this paper investigates and proposes a novel method based on templates for automatically labeling the positive URLs to develop classifier-guided topical crawlers. A series of off-line and on-line experiments are performed extensively. The results demonstrate that the classifier-guided topical crawler trained with labeled URLs has higher recall than the one trained with labeled WebPages. The results also prove that the classifier using immediate vicinity of hyperlinks and the corresponding anchor texts leads the crawler to attain harvest rate of about 95%.
What problem does this paper attempt to address?