Coresets and Sketches

Jeff M. Phillips
DOI: https://doi.org/10.48550/arXiv.1601.00617
2016-06-13
Abstract:Geometric data summarization has become an essential tool in both geometric approximation algorithms and where geometry intersects with big data problems. In linear or near-linear time large data sets can be compressed into a summary, and then more intricate algorithms can be run on the summaries whose results approximate those of the full data set. Coresets and sketches are the two most important classes of these summaries. We survey five types of coresets and sketches: shape-fitting, density estimation, high-dimensional vectors, high-dimensional point sets / matrices, and clustering.
Computational Geometry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently compress and approximate geometric data in the context of big data, so that complex algorithms can be run quickly and the accuracy of the results can be guaranteed. Specifically, the paper focuses on two data compression methods, **Coresets** and **Sketches**. ### Core Problems 1. **Coresets**: Coresets are a reduced data set that can act as a proxy for the complete data set. By running the same algorithm on Coresets, similar results can be obtained as running on the complete data set. The paper explores different types of Coresets, such as Coresets in shape fitting, density estimation, high - dimensional vectors, high - dimensional point sets/matrices, and clustering problems. 2. **Sketches**: Sketches map the complete data set onto an easily updatable data structure, so that the results of certain queries can be approximated to the query results on the complete data set. The paper discusses linear Sketches, where the mapping is a linear function of each data point, facilitating the addition, deletion, or modification of data. ### Specific Objectives - **Shape Fitting**: Find the shape that best fits a given point set, such as the minimum enclosing sphere and ε - core Coresets. - **Density Estimation**: Select a subset from the discrete density function so that it is similar to the density function of the original data set under a specific metric. - **High - Dimensional Vectors**: Approximate the frequency count and frequency moment of high - dimensional vectors. - **High - Dimensional Point Sets/Matrices**: Perform low - rank approximation on high - dimensional point sets or matrices, especially in the application of streaming data processing and distributed computing environments. - **Clustering**: Use Coresets and Sketches in clustering problems to reduce computational complexity. ### Technical Challenges - **Space Efficiency**: How to store Coresets and Sketches within a limited space, especially in streaming data processing and distributed computing environments. - **Time Efficiency**: How to construct and update Coresets and Sketches within a limited time, especially on large - scale data sets. - **Error Control**: How to ensure that the error between the approximate results of Coresets and Sketches and the results on the complete data set is within an acceptable range. ### Methodology - **Random Sampling**: Construct Coresets and Sketches through random sampling, and use theories such as VC dimension to ensure the accuracy of approximation. - **Merge - Reduce Framework**: In streaming data processing and distributed computing, efficiently construct Coresets and Sketches through merge and reduce operations. - **Linear Projection**: Use the Johnson - Lindenstrauss lemma to reduce the dimension of data through random projection while maintaining the structural characteristics of the data. ### Application Scenarios - **Machine Learning**: Train models on large - scale data sets and reduce the consumption of computing resources. - **Data Mining**: Quickly discover patterns and trends in large data sets. - **Graphics Processing**: Perform shape fitting and clustering analysis in high - dimensional data sets. In general, this paper aims to provide an efficient and accurate data compression method through Coresets and Sketches techniques to meet the computational challenges in the context of big data.