Split Bloom Filter

肖明忠,代亚非,李晓明
DOI: https://doi.org/10.3321/j.issn:0372-2112.2004.02.015
2004-01-01
Tien Tzu Hsueh Pao/Acta Electronica Sinica
Abstract:A Bloom Filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries, which uses an m-bit array to represent a data set and queries by hashing. The representation is the payoff for allowing a small rate of false positives in membership queries; that is, queries might wrongly regard an element as member of the set. However, for many applications, especially large-scale data set systems, the space savings and the locate time constantly outweigh this drawback when the probability of an error is sufficiently low and can suffer from by the application. The paper firstly surveys Bloom Filter and its variants in detail, and gives mathematical analysis behind them about space/time/error rate tradeoffs in order to explain their practicability. Then, we present a new kind of Bloom Filter-Split Bloom Filter, which uses a s × m-bit matrix to represent a set, and give analysis in detail as the formers. In distributed systems, each network node owns a data set, which is reasonable that some nodes are large number of data while the large number of nodes are a bit of data, if all nodes uses same parameters of algorithm, then it will give rise to a memory space wholly wasted. In addition, if the number of the elements of a node data set increases continually, the error rate will increasingly make the representation nonsensically. We prove that the Split Bloom Filter can efficiently solve or weaken the two problems.
What problem does this paper attempt to address?