Schedule Web Forum Crawling with a Freshness-First Strategy

Jingtian Jiang,Nenghai Yu
DOI: https://doi.org/10.1109/iccsnt.2011.6182369
2011-01-01
Abstract:Web forums have become an important data resource for research as there is much user generated content (UGC) every day. Thus efficient web forum crawling is a crucial problem. Previous works all focus on crawling all the forum threads with minimal overhead. They treat all threads equally and adopt a breadth-first strategy. Some strategies such as PageRank considered the difference in link relations. However, none of them consider the difference between new threads and the old threads. Thus they are not efficient enough in real-time applications. In real-time applications, freshness is a significant factor as users always prefer to fresh results rather than old ones. In this paper, we propose a freshness-first strategy for web forum crawling, which aims to fetch the fresher content prior to less fresh content. The freshness-first strategy is based on the characteristic of web forums - usually there are last update times corresponding to the thread URLs. Through detecting the last update times of URLs in board pages, the proposed strategy schedules the crawling order of threads according to their freshness, i.e. the last update time. Experiment results demonstrated that the freshness-first strategy definitely achieved our goal of crawling freshest content first and significantly outperformed other strategies by 40% in different situations.
What problem does this paper attempt to address?