Finding needles in a hay stream: On persistent item lookup in data streams
Lin Chen,Haipeng Dai,Lei Meng,Jihong Yu
DOI: https://doi.org/10.1016/j.comnet.2020.107518
IF: 5.493
2020-11-01
Computer Networks
Abstract:<p>In a data stream composed of an ordered sequence of data items, <em>persistent items</em> refer to those persisting to occur over a long timespan. Compared with ordinary items, persistent ones, though not necessarily occurring more frequently, typically convey more valuable information. <em>Persistent item lookup</em>, the functionality to identify all persistent items, emerges as a pivotal building block in many computing and network systems. In this paper, we devise a generic persistent item lookup algorithm supporting high-speed, high-accuracy lookup with limited memory cost. The key technicalities we propose in our design are two-fold. First, our algorithm attempts to record only persistent items seen so far based on the currently available information about the stream, thus significantly reducing memory overhead, especially for real-life highly skewed data streams. Second, our algorithm balances the recording load in both time and space domains: in the time domain, we partition persistent items into approximately equal-size subsets and record only one subset in each epoch; in the space domain, we apply the state-of-the-art load balancing technique to evenly distribute recorded items across the on-die memory. By holistically integrating these components, we iron out a persistent item lookup algorithm outperforming existing solutions in a wide range of practical settings.</p>
computer science, information systems,telecommunications,engineering, electrical & electronic, hardware & architecture