Continuously Distinct Sampling over Centralized and Distributed High Speed Data Streams

Pinghui Wang,Xiangyu Wang,Jing Tao,Peng Zhang,Xiaohong Guan
DOI: https://doi.org/10.1109/tpds.2018.2865452
IF: 5.3
2019-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Distinct sampling is fundamental for computing statistics (e.g., the age and gender distribution of distinct users accessing a particular website) depending on the set of distinct keys (e.g., user IDs) in a large and high speed data stream such as a sequence of key-update pairs. However, the major shortcoming of existing methods is their high computational cost incurred by determining whether each incoming key in the data stream is currently in the set of sampled keys and keeping track of sampled keys' update aggregations. To solve this challenge, we develop a new method random projection and eviction (RPE) that uses a list of buckets to continuously sample distinct keys and their update aggregations. RPE processes each key-update pair with small and nearly constant time complexity $O(1)$ . Besides centralized data streams, we also develop a novel method DRPE to deal with distributed data streams consisting of key-update pairs observed at multiple distributed sites. We conduct extensive experiments on real-world datasets, and the results demonstrate that RPE and DRPE reduce the memory, computational, and message costs of state-of-the-art methods by several times.
What problem does this paper attempt to address?