DISCO: A Dynamically Configurable Sketch Framework in Skewed Data Streams
Jiaqian Liu,Ran Ben Basat,Louis De Wardt,Haipeng Dai,Guihai Chen
DOI: https://doi.org/10.1109/icde60146.2024.00365
2024-01-01
Abstract:Sketches have gained popularity as effective methods for estimating frequency in data streams, and optimizing their accuracy is critical in many applications. However, while sketches are backed by a standard guarantee under the worst-case analysis, their actual errors can vary significantly with real-world skewed data streams. Therefore, it is challenging to configure sketches to optimize accuracy without prior knowledge of the input. Moreover, even with a new configuration, it is unclear when to apply it. This paper presents a novel sketch framework that can be dy-namically configured to optimize the accuracy given a processed data stream. Specifically, we provide a precise guarantee and derive an optimal number of hash functions under the Zipfian distribution, which is an appropriate way to model skewed data streams in practice. We then propose a dynamically configurable sketch framework, namely DISCO, that can estimate the distri-bution parameter and adjust the number of hash functions on the fly to optimize accuracy. We provide rigorous mathematical analysis and apply DISCO to three classical solutions, including the Count-min, Conservative Update, and Count sketches. Experimental results, using synthetic and real datasets, show that DISCO can achieve the optimal configuration for the metric (i.e., FP) related to the sketch guarantee, while achieving near-optimal accuracy for other common metrics (e.g., ARE) compared with state-of-the-art methods.