Minimum Epsilon-Kernel Computation for Large-Scale Data Processing

Hong-Jie Guo,Jian-Zhong Li,Hong Gao
DOI: https://doi.org/10.1007/s11390-022-2429-6
2022-01-01
Abstract:Kernel is a kind of data summary which is elaborately extracted from a large dataset. Given a problem, the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio. It is widely used in geometric optimization, clustering, and approximate query processing, etc., for scaling them up to massive data. In this paper, we focus on the minimum ε -kernel (MK) computation that asks for a kernel of the smallest size for large-scale data processing. For the open problem presented by Wang et al . that whether the minimum ε -coreset (MC) problem and the MK problem can be reduced to each other, we first formalize the MK problem and analyze its complexity. Due to the NP-hardness of the MK problem in three or higher dimensions, an approximate algorithm, namely Set Cover-Based Minimum ε -Kernel algorithm (SCMK), is developed to solve it. We prove that the MC problem and the MK problem can be Turing-reduced to each other. Then, we discuss the update of MK under insertion and deletion operations, respectively. Finally, a randomized algorithm, called the Randomized Algorithm of Set Cover-Based Minimum ε -Kernel algorithm (RA-SCMK), is utilized to further reduce the complexity of SCMK. The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets. Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets, respectively. The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets. RA-SCMK runs up to three times faster than SCMK on synthetic datasets.
What problem does this paper attempt to address?