Abstract:To fill the absent values in massive-domain data stream,an algorithm based on Count-Min Sketch as well as Frequency-Min Sketch is proposed.In some data stream applications,such as network traffic monitoring,the domains of IP addresses and some other attributes are always massive.Data Stream Management System usually prefers to storage a sketch rather than storage the whole datasets.So it is not suitable to use traditional imputation methods to fill the absent values of the massive-domain data stream,which are generated during data collection and transmission process,or ignore them.Count-Min Sketch is a famous lightweight data stream sketch.This paper proposes a Frequency-Min Sketch based on it.With Count-Min Sketch and Frequency-Min Sketch,a method filling the absent values in the massive-domain data stream is designed.A hashing approach is utilized in order to keep track of the attribute-value statistics.We use pairwise independent hash functions H={h(x)=((ax+b) mod p) mod w,x ∈U,a,b∈Zp},each of which maps onto uniformly random integers in the range h=.The data structure itself consists of a 2-dimensional array with a length of d and width of w.Each hash function corresponds to one of w 1-dimensional arrays with d cells each.In network traffic monitoring applications,the hash functions are used in order to update the traffic sums in the different cells of the 2-dimensional data structure.At the same time,the numbers of data packets are updated in the corresponding cells.Then the quotient of the min traffic sum(i.e.Count-Min Sketch) and the min number of the data packets(we call it Frequency-Min Sketch,which is not sure in the same cell with the Count-Min Sketch) is used to fill the absent traffic of a data packet.Theoretical analysis and experiment results show that the error of filling the absent values based on Count-Min Sketch and Frequency-Min Sketch is less than that based on Histogram Sketch.The error does not increased along with the increase of the attributes domains or tends towards a stable value with the increase of data quantity.The method also has a lower time and space complexity.Given an error parameter ε,the time and space bound is 1/ε(sometimes the time bound is 1).

Fill absent values in massive domain data stream

An Accurate Estimation Algorithm for Big Data Streams.

Diamond Sketch: Accurate Per-flow Measurement for Big Streaming Data

SF-Sketch: A Two-Stage Sketch for Data Streams

QSketch: An Efficient Sketch for Weighted Cardinality Estimation in Streams

Locality-Sensitive Sketching for Resilient Network Flow Monitoring

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Count-Less: A Counting Sketch for the Data Plane of High Speed Switches

Discussion On Fast And Accurate Sketches For Skewed Data Streams: A Case Study

An effective and accurate flow size measurement using funnel-shaped sketch

On-off sketch

Fine-grained Probability Counting for Cardinality Estimation of Data Streams.

A New Sketch Method for Measuring Host Connection Degree Distribution

DISCO: A Dynamically Configurable Sketch Framework in Skewed Data Streams

FID-sketch: an Accurate Sketch to Store Frequencies in Data Streams

HistSketch: A Compact Data Structure for Accurate Per-Key Distribution Monitoring.

A Sketch Algorithm to Monitor High Packet Delay in Network Traffic

Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams

Sampling Space-Saving Set Sketches

Sampling-based Estimation of the Number of Distinct Values in Distributed Environment

Cuckoo Counter: Adaptive Structure of Counters for Accurate Frequency and Top-k Estimation