Abstract:Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis. We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and block-level sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.

Equi-depth Histogram Construction for Big Data with Quality Guarantees

Differentially Private Histogram Publication for Dynamic Datasets: an Adaptive Sampling Approach.

Histograms and Wavelets on Probabilistic Data

PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data Compression

Adaptive algorithm of histogram maintain in data stream processing

An Efficient and Compact Indexing Scheme for Large-Scale Data Store.

Sample and Threshold Differential Privacy: Histograms and applications

Towards answering analytical query over hierarchical histogram under untrusted servers

Approximate Computation for Big Data Analytics

Floating-point histograms for exploratory analysis of large scale real-world data sets

Gapprox: using Gallup approach for approximation in Big Data processing

A survey of data partitioning and sampling methods to support big data analysis

Summarizing level-two topological relations in large spatial datasets

A Utility-Optimized Framework for Personalized Private Histogram Estimation

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

Splitting Large Medical Data Sets Based on Normal Distribution in Cloud Environment

Personalized Privacy-Preserving Data Aggregation for Histogram Estimation.

Private Weighted Histogram Aggregation in Crowdsourcing

A histogram equalization algorithm based on building a grey level binary tree dynamically

Data Processing Model to Perform Big Data Analytics in Hybrid Infrastructures

Nebula: Efficient, Private and Accurate Histogram Estimation