Better with Fewer Bits: Improving the Performance of Cardinality Estimation of Large Data Streams

Qingjun Xiao,You Zhou,Shigang Chen
DOI: https://doi.org/10.1109/infocom.2017.8057088
2017-01-01
Abstract:Cardinality estimation is the task of determining the number of distinct elements (or the cardinality) in a data stream, under a stringent constraint that the input data stream can be scanned by just a single pass. This is a fundamental problem with many practical applications, such as traffic monitoring of high-speed networks and query optimization of Internetscale database. To solve the problem, we propose an algorithm named HLL-TailCut+, which implements the estimation standard error 1.0/√m using the memory units of three bits each, whose cost is much smaller than the five-bit memory units used by HyperLogLog, the best previously known cardinality estimator. This makes it possible to reduce the memory cost of HyperLogLog by 45%. For example, when the target estimation error is 1.1%, state-of-the-art HyperLogLog needs 5.6 kilobytes memory. By contrast, our new algorithm only needs 3 kilobytes memory consumption for attaining the same accuracy. Additionally, our algorithm is able to support the estimation of very large stream cardinalities, even on the Tera and Peta scale.
What problem does this paper attempt to address?