HistSketch: A Compact Data Structure for Accurate Per-Key Distribution Monitoring.

Jintao He,Jiaqi Zhu,Qun Huang
DOI: https://doi.org/10.1109/icde55515.2023.00156
2023-01-01
Abstract:Stream processing is critical to data analytics. However, one important class of characteristics namely per-key distribution (i.e., the item distribution of every key) remains unsolved. Traditional stream processing methods such as sampling and histogram do not focus on per-key distribution. Though sketch is widely applied to deal with huge and high-speed streaming data, it mainly computes singular-value characteristics. However, per-key distribution needs to deal with multiple values for each key, which amplifies the needed resources.To this end, we present a novel sketch-based algorithm HistSketch for per-key distribution. Its key idea is to differentiate hot keys from infrequent keys and use different components to deal with them. For hot keys, HistSketch allocates dedicated counters. For infrequent keys, HistSketch allows counter sharing to alleviate memory usage. In addition, we propose two optimization mechanisms for HistSketch: the histogram shedding mechanism further reduces the storage overheads, while the equation-based decoding compensates for the error caused by counter sharing. Our evaluation compares HistSketch with nine state-of-the-art sketch-based solutions using five datasets. Our results show that HistSketch achieves both high accuracy and low resource usage.
What problem does this paper attempt to address?