Fill absent values in massive domain data stream

Zhao Fei,Liu Qi-Zhi,Zhang Yan,Bai Wen-Yang
DOI: https://doi.org/10.13232/j.cnki.jnju.2011.01.013
2011-01-01
Abstract:To fill the absent values in massive-domain data stream,an algorithm based on Count-Min Sketch as well as Frequency-Min Sketch is proposed.In some data stream applications,such as network traffic monitoring,the domains of IP addresses and some other attributes are always massive.Data Stream Management System usually prefers to storage a sketch rather than storage the whole datasets.So it is not suitable to use traditional imputation methods to fill the absent values of the massive-domain data stream,which are generated during data collection and transmission process,or ignore them.Count-Min Sketch is a famous lightweight data stream sketch.This paper proposes a Frequency-Min Sketch based on it.With Count-Min Sketch and Frequency-Min Sketch,a method filling the absent values in the massive-domain data stream is designed.A hashing approach is utilized in order to keep track of the attribute-value statistics.We use pairwise independent hash functions H={h(x)=((ax+b) mod p) mod w,x ∈U,a,b∈Zp},each of which maps onto uniformly random integers in the range h=.The data structure itself consists of a 2-dimensional array with a length of d and width of w.Each hash function corresponds to one of w 1-dimensional arrays with d cells each.In network traffic monitoring applications,the hash functions are used in order to update the traffic sums in the different cells of the 2-dimensional data structure.At the same time,the numbers of data packets are updated in the corresponding cells.Then the quotient of the min traffic sum(i.e.Count-Min Sketch) and the min number of the data packets(we call it Frequency-Min Sketch,which is not sure in the same cell with the Count-Min Sketch) is used to fill the absent traffic of a data packet.Theoretical analysis and experiment results show that the error of filling the absent values based on Count-Min Sketch and Frequency-Min Sketch is less than that based on Histogram Sketch.The error does not increased along with the increase of the attributes domains or tends towards a stable value with the increase of data quantity.The method also has a lower time and space complexity.Given an error parameter ε,the time and space bound is 1/ε(sometimes the time bound is 1).
What problem does this paper attempt to address?