Efficient Focused Crawling Strategy Using Combination of Link Structure and Content Similarity

Qu Cheng,Wang Beizhan,Wei Pianpian
DOI: https://doi.org/10.1109/itme.2008.4744029
2008-01-01
Abstract:At present, focused crawler usually crawl pages using the link structure or page contents. But both of them have some flaws. So we designed an efficient crawling strategy, which combine the link structure with content similarity. We extracted topic feature vector automatically and judge the topic similarity of a page using combination of link structure and page content. We also forecast the URL similarity using link structure in topic pages. Experiments showed that this strategy effectively increase the precision of fetching topic pages.
What problem does this paper attempt to address?