Algorithm of Estimating Index Sizes of Resource Collections in Distributed Search

吴晟,李星
DOI: https://doi.org/10.3724/sp.j.1087.2008.02345
2009-01-01
Abstract:Distributed search is an effective way to search the Deep Web,while collection size is an important feature in collection representation and selection in distributed search.To estimate collection size in uncooperative environments,the two novel algorithms were proposed in this paper.High frequent resample algorithm first samples collections with random queries,then resamples with high frequent queries in the sample set.Heterogeneous capture algorithm,based on the assumption of different capture probabilities among documents,uses Logistic functions and conditional maximum likelihood.Experimental results show that the algorithms outperform both sample-resample and capture-recapture algorithms.
What problem does this paper attempt to address?