Abstract:Estimating the Jaccard similarity between two or more streaming sets is a fundamental problem with many applications, such as mining the co-occurrences of host communication behaviors in IP networks and the co-purchasing behaviors of online users in E-commence systems. For a “streaming” data set, its elements arrive sequentially at the stream processing engine, whose working memory is limited in size. To meet the resource constraint, the data set must be stored as a compressed format called a “sketch”, so that millions or even billions of such sets can be held in the memory of the engine. To strike a balance between memory cost and similarity estimation accuracy, many sketching algorithms were proposed, such as MinHash, HyperLogLog and theta sketch. As far as we know, most of previous solutions fail to handle the fully dynamic streaming sets that allow both the insertion and the deletion of elements. Although PCSA $\pm$ , and virtual odd sketch (VOS) partially solves the deletion problem, their memory efficiency and estimation accuracy can be fundamentally improved. In this paper, we propose a multi-resolution odd sketch (MROS), which allows more accurate similarity estimation with less memory consumption. Its design is to encode a streaming set into multiple layers of odd sketches with exponentially reducing sampling probabilities. No matter a set is small or large, we can pick a suitable sampling probability to accurately estimate its cardinality. Next, we present an algorithm to estimate the extended Jaccard similarity of multiple streaming sets. Their compressed MROS summaries are merged by bitwise XOR, which in turn helps estimate the cardinality of symmetric difference of the multiple sets. Then, using these estimated cardinalities as observations, we estimate the size of an arbitrary set expression, which is connected by union $\bigcup$ , intersection $\bigcap$ , relative complement $\setminus$ , and symmetric difference $\Delta$ . Our evaluation results show that its accuracy outperforms MinHash, KMV, PCSA $\pm$ and VOS, and it can support element deletion.

Sampling Space-Saving Set Sketches

MTS Sketch for Accurate Estimation of Set-Expression Cardinalities from Small Samples

Multi-resolution Odd Sketch for Mining Extended Jaccard Similarity of Dynamic Streaming Sets

An Accurate Estimation Algorithm for Big Data Streams.

SQUAD: Combining Sketching and Sampling Is Better than Either for Per-item Quantile Estimation

Multi-resolution Odd Sketch for Mining Jaccard Similarities Between Dynamic Streaming Sets

OneSketch: A Generic and Accurate Sketch for Data Streams

QSketch: An Efficient Sketch for Weighted Cardinality Estimation in Streams

Diamond Sketch: Accurate Per-flow Measurement for Big Streaming Data

Sketch-Flip-Merge: Mergeable Sketches for Private Distinct Counting

Discussion On Fast And Accurate Sketches For Skewed Data Streams: A Case Study

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

SF-Sketch: A Two-Stage Sketch for Data Streams

SimiSketch: Efficiently Estimating Similarity of streaming Multisets

ABC: A practicable sketch framework for non-uniform multisets

Consistent Weighted Sampling Made Fast, Small, and Easy

Generalized Sketches for Streaming Sets

HistSketch: A Compact Data Structure for Accurate Per-Key Distribution Monitoring.

SF-sketch: A Fast, Accurate, and Memory Efficient Data Structure to Store Frequencies of Data Items

Bubble Sketch: A High-performance and Memory-efficient Sketch for Finding Top- K Items in Data Streams

Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching