A novel incremental parallel web crawler based on focused crawling
Qiuyan Huang,Qingzhong Li,Zhongmin Yan,Hong Fu
Journal of Computational Information Systems
Abstract:With the tremendous growth of the Web, it has become a huge challenge for the all-purpose singleprocess crawlers to locate the resources that are precise and relevant in an appropriate amount of time, so more enhanced and convincing algorithms are in demand. In this paper, a novel incremental parallel Web crawler based on focused crawling is proposed, which can crawl the Web pages that are relevant to multiple pre-defined topics concurrently. Furthermore, to solve the issue of URL distribution, a compound decision model based on multi-objective decision making method is introduced, which considers multiple factors synthetically such as load balance, relevance and so on; and to solve the issue of update frequency of local repository decision, a update frequency graph model is presented, in which the graph is constructed dynamically according to the update frequency of Web pages. The extensive experiments show that our proposed system can acquire high quality, high relevance and high freshness Web information efficiently. Copyright © 2013 Binary Information Press.