Abstract:Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis. We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and block-level sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.

Hadoop Distributed File System for Big data analysis

Design and Implementation of Clinical Data Integration and Management System Based on Hadoop Platform

A Distributed Data Mining System Framework for Mobile Internet Access Log Based on Hadoop.

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

Hadoop, MapReduce and HDFS: A Developers Perspective

Distributed data management using MapReduce

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Data Management Techniques in Hadoop Framework for Handling Small Files: A Survey

An experimental approach towards big data for analyzing memory utilization on a hadoop cluster using HDFS and MapReduce

Past, Present and Future of Hadoop: A Survey

Power Big Data Analysis Platform Design Based on Hadoop

Analysis of Big Data Platform with OpenStack and Hadoop.

Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

Large-scale Data Modelling in Hive and Distributed Query Processing using MapReduce and Tez

Design and Implementation of Log Data Analysis Management System Based on Hadoop

Design and development of a medical big data processing system based on Hadoop

Visualization of Big Data with the Map-Reduce program execution platform: Hadoop

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

A Brief Introduction on Big Data 5Vs Characteristics and Hadoop Technology

A survey of data partitioning and sampling methods to support big data analysis

Big data analysis in e-commerce system using HadoopMapReduce