Abstract:Large-scale data-intensive cloud computing with the MapReduce framework is becoming pervasive for the core business of many academic, government, and industrial organizations. Hadoop, a state-of-the-art open source project, is by far the most successful realization of MapReduce framework. While MapReduce is easy- to-use, efficient and reliable for data-intensive computations, the excessive configuration parameters in Hadoop impose unexpected challenges on running various workloads with a Hadoop cluster effectively. Consequently, developers who have less experience with the Hadoop configuration system may devote a significant effort to write an application with poor performance, either because they have no idea how these configurations would influence the performance, or because they are not even aware that these configurations exist. There is a pressing need for comprehensive analysis and performance modeling to ease MapReduce application development and guide performance optimization under different Hadoop configurations. In this paper, we propose a statistical analysis approach to identify the relationships among workload characteristics, Hadoop configurations and workload performance. We apply principal component analysis and cluster analysis to 45 different metrics, which derive relationships between workload characteristics and corresponding performance under different Hadoop configurations. Regression models are also constructed that attempt to predict the performance of various workloads under different Hadoop configurations. Several non-intuitive relationships between workload characteristics and performance are revealed through our analysis and the experimental results demonstrate that our regression models accurately predict the performance of MapReduce workloads under different Hadoop configurations.

MapReduce Workload Modeling with Statistical Approach

Statistics-based Workload Modeling for MapReduce

An Analytical Performance Model of MapReduce

The performance of MapReduce: an in-depth study

The Performance of MapReduce

Comparison and Improvement of Hadoop MapReduce Performance Prediction Models in the Private Cloud.

Reliable Estimation of Execution Time of MapReduce Program

Performance Modeling and Optimization of MapReduce Programs.

Hadoop Performance Modeling for Job Estimation and Resource Provisioning

Modeling the Performance of MapReduce under Resource Contentions and Task Failures

A Hadoop MapReduce Performance Prediction Method

Benchmarking and Performance Modelling of MapReduce Communication Pattern

Performance Prediction Model in Heterogeneous MapReduce Environments

Energy Consumption Modeling and Optimization Analysis for MapReduce

Energy Prediction for MapReduce Workloads

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Performance Modeling and Prediction of Big Data Workflows: an Exploratory Analysis.

Accelerating Big Data Application by Eliminating Redundancy on Hadoop Cluster

An Uncoupled Data Process and Transfer Model for MapReduce.

Improving MapReduce Performance in a Heterogeneous Cloud: A Measurement Study

Performance models and dynamic characteristics analysis for HDFS write and read operations: A systematic view.