Finding Significant Items in Data Streams.

Tong Yang,Haowei Zhang,Dongsheng Yang,Yucheng Huang,Xiaoming Li
DOI: https://doi.org/10.1109/icde.2019.00126
2019-01-01
Abstract:Finding top-k frequent items has been a hot issue in databases. Finding top-k persistent items is a new issue, and has attracted increasing attention in recent years. In practice, users often want to know which items are significant, i.e., not only frequent but also persistent. No prior art can address both of the above two issues at the same time. Also, for high-speed data streams, they cannot achieve high accuracy when the memory is tight. In this paper, we define a new issue, named finding top-k significant items, and propose a novel algorithm namely LTC to address this issue. It includes two key techniques: Long-tail Replacement and a modified CLOCK algorithm. We theoretically prove there is no overestimation error and derive the correct rate and error bound. We conduct extensive experiments on three real datasets. Our experimental results show that LTC achieves 300~10^8 and in average 10^5 times higher accuracy than other related algorithms.
What problem does this paper attempt to address?