LTC: A Fast Algorithm to Accurately Find Significant Items in Data Streams
Shiyu Cheng,Dongsheng Yang,Tong Yang,Haowei Zhang,Bin Cui
DOI: https://doi.org/10.1109/tkde.2020.3038911
IF: 9.235
2022-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:Finding top- $k$ frequent items has been a hot issue in databases. Finding top- $k$ persistent items is a new issue, and has attracted increasing attention in recent years. In practice, users often want to know which items are significant, i.e. , not only frequent but also persistent. No prior art can address both of the above two issues at the same time. Also, for high-speed data streams, prior art cannot achieve high accuracy when the memory is tight. In this paper, we define a new issue, named finding significant items, and propose a novel algorithm namely LTC to address this issue. It includes two key techniques, Long-tail Restoring and CLOCK, as well as three optimizations. In addition, LTC is extended to support finding significant items with thresholds. We theoretically derive the correct rate and error bound, and conduct extensive experiments on three real datasets to test the performance of LTC. Our experimental results show that LTC can achieve $10^5$ times higher accuracy in terms of average relative error than other related algorithms. Lastly, LTC is applied to a DDoS detection task and it shows that finding significant items is more powerful than finding frequent items.