System Model of Incremental Spider for the Chinese Web and Its Implementation

MENG Tao,YAN Hongfei,WANG Jimin
DOI: https://doi.org/10.3321/j.issn:1000-0054.2005.09.034
2005-01-01
Abstract:This paper is aimed at efficient incremental information collection from the Chinese web. The experiments were first designed and performed to inspect how pages were evolved in a short period. Based on the results, a general system model was established for incremental spiders. Then the latent performance bottle-necks in implementation were deeply analyzed, with corresponding solutions supplied. Besides, two particular approaches were put forward to efficiently collect updated or newly-born pages in this model: using temporal locality in change of pages to catch the updated, and using Index-pages to find the newly born. The model and its strategies in the implementation presented in this paper have been successfully applied in TianWang system, and are also valuable for other analogous systems.
What problem does this paper attempt to address?