QSketch: An Efficient Sketch for Weighted Cardinality Estimation in Streams

Yiyan Qi,Rundong Li,Pinghui Wang,Yufang Sun,Rui Xing

2024-06-27

Abstract:Estimating cardinality, i.e., the number of distinct elements, of a data stream is a fundamental problem in areas like databases, computer networks, and information retrieval. This study delves into a broader scenario where each element carries a positive weight. Unlike traditional cardinality estimation, limited research exists on weighted cardinality, with current methods requiring substantial memory and computational resources, challenging for devices with limited capabilities and real-time applications like anomaly detection. To address these issues, we propose QSketch, a memory-efficient sketch method for estimating weighted cardinality in streams. QSketch uses a quantization technique to condense continuous variables into a compact set of integer variables, with each variable requiring only 8 bits, making it 8 times smaller than previous methods. Furthermore, we leverage dynamic properties during QSketch generation to significantly enhance estimation accuracy and achieve a lower time complexity of $O(1)$ for updating estimations upon encountering a new element. Experimental results on synthetic and real-world datasets show that QSketch is approximately 30\% more accurate and two orders of magnitude faster than the state-of-the-art, using only $1/8$ of the memory.

Databases,Data Structures and Algorithms

What problem does this paper attempt to address?

This paper mainly discusses the problem of estimating weighted cardinality (i.e., the total weight of different elements) in data streams. Existing methods require a large amount of memory and computational resources to deal with weighted cardinality, which poses challenges for devices with limited capabilities and real-time applications such as anomaly detection. To address this problem, the paper proposes QSketch, an efficient and memory-saving sketching method for estimating weighted cardinality in data streams. QSketch compresses continuous variables into a group of 8-bit integer variables using quantization techniques, which is 8 times smaller than previous methods. Furthermore, it significantly improves estimation accuracy by utilizing dynamic properties during the generation of QSketch, and achieves a time complexity of O(1) for updating estimation values, which means it can quickly update when encountering new elements. Experimental results show that QSketch achieves approximately 30% higher accuracy and is two orders of magnitude faster than state-of-the-art methods on synthetic and real-world datasets, while using only 1/8 of the memory. The paper also proposes an extended version called QSketch-Dyn, which can track weighted cardinality in real-time and further improves efficiency. Overall, this paper aims to address the problem of estimating weighted cardinality in data streams. By introducing QSketch and QSketch-Dyn, it provides a more efficient and memory-friendly solution, particularly useful for handling high-speed data streams and devices with limited resources.

QSketch: An Efficient Sketch for Weighted Cardinality Estimation in Streams

An Accurate Estimation Algorithm for Big Data Streams.

CardSketch: Shift Attention for Network-wide Cardinality Telemetry

Composed Sketch Framework for Quantiles and Cardinality Queries over Big Data Streams

OneSketch: A Generic and Accurate Sketch for Data Streams

Simple and Efficient Cardinality Estimation in Data Streams

MTS Sketch for Accurate Estimation of Set-Expression Cardinalities from Small Samples

Discussion On Fast And Accurate Sketches For Skewed Data Streams: A Case Study

Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries

A Better Cardinality Estimator with Fewer Bits, Constant Update Time, and Mergeability.

Approaching 100% Confidence in Stream Summary through ReliableSketch

gSketch: On Query Estimation in Graph Streams

A Generic Sketch for Estimating Super-Spreaders and Per-Flow Cardinality Distribution in High-Speed Data Streams

SQUAD: Combining Sketching and Sampling Is Better than Either for Per-item Quantile Estimation

SF-Sketch: A Two-Stage Sketch for Data Streams

From CountMin to Super Kjoin Sketches for Flow Spread Estimation

Sampling Space-Saving Set Sketches

Better with Fewer Bits: Improving the Performance of Cardinality Estimation of Large Data Streams

Diamond Sketch: Accurate Per-flow Measurement for Big Streaming Data

OrderSketch: An Unbiased and Fast Sketch for Frequency Estimation of Data Streams

Generalized Sketches for Streaming Sets