A novel focused crawler based on breadcrumb navigation

Lizhi Ying,Xinhao Zhou,Jian Yuan,Yongfeng Huang
DOI: https://doi.org/10.1007/978-3-642-31020-1_31
2012-01-01
Abstract:In this paper, a novel focused crawler based on Breadcrumb Navigation (BN) is proposed. It mainly leverages Breadcrumb Navigation in the webpages to reconstruct the website structures and resolve focused crawling problems. Different from some previous focused crawlers which use prediction models, the BN crawler firstly samples the web to construct the semantic forest for websites based on Breadcrumb Navigation, and then searches the forest to find the sub-trees relevant to the given topic. After sampling, the BN crawler only needs to download the webpages belonging to the relevant sub-forest. By using this method, the BN crawler costs less time to analyze the Webpage-to-Topic (W2T) similarity but results in a highly efficient performance. The experimental evidences show that the BN crawler significantly outperforms Breadth-First and Best-First crawlers in harvest ratio and can be widely used for most websites.
What problem does this paper attempt to address?