Estimating Collection Size in Distributed Search

Jingfang Xu,Sheng Wu,Xing Li
2007-01-01
Abstract:Distributed search is an effective way to search information over thousands of information collections available on the web. As an important feature in distributed search, collection size plays a vital role in resource representation and selection. This paper proposes two novel algorithms to estimate collection size in uncooperative environments. Sample high frequent resample (SHFRS) algorithm firstly samples collections with random queries and then resamples with highest frequent queries in sample sets. Considering different capture probabilities across documents, heterogeneous capture (HC) algorithm estimates collection size with conditional maximum likelihood. Both algorithms are evaluated on real web data. Experimental results show that our algorithms outperform significantly both sample-resample and capture-recapture algorithms.
What problem does this paper attempt to address?