Abstract:MapReduce has been widely used for large-scale data analysis in the Cloud. The system is well recognized for its elastic scalability and fine-grained fault tolerance although its performance has been noted to be suboptimal in the database context. According to a recent study [19], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. MapReduce can achieve better performance with the allocation of more compute nodes from the cloud to speed up computation; however, this approach of "renting more nodes" is not cost effective in a pay-as-you-go environment. Users desire an economical elastically scalable data processing system, and therefore, are interested in whether MapReduce can offer both elastic scalability and efficiency. In this paper, we conduct a performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. We identify five design factors that affect the performance of Hadoop, and investigate alternative but known methods for each factor. We show that by carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5 for the same benchmark used in [19], and is thus more comparable to that of parallel database systems. Our results show that it is therefore possible to build a cloud data processing system that is both elastically scalable and efficient.

A Comparative Study of Data Skew in Hadoop

Join Query Optimization Based on MapReduce under Skewed Data

SkewControl: Gini out of the Bottle.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

A Request Skew Aware Heterogeneous Distributed Storage System Based on Cassandra

TS-Hadoop: Handling Access Skew in MapReduce by Using Tiered Storage Infrastructure

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

The performance of MapReduce: an in-depth study

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

The Performance of MapReduce

OPTIMIZATION FOR SPARK MISSION PERFORMANCE BASED ON DATA CHARACTERISTICS

An Experimental Evaluation of Performance of A Hadoop Cluster on Replica Management

SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming

An Adaptive Skew Handling Join Algorithm for Large-scale Data Analysis

Efficient. Scalable and Robust Data Shuffle Service for Distributed MapReduce Computing on Cloud

Moving Hadoop into the Cloud with Flexible Slot Management and Speculative Execution

Discussion On Fast And Accurate Sketches For Skewed Data Streams: A Case Study

DISCO: A Dynamically Configurable Sketch Framework in Skewed Data Streams

A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model to a Cluster Architecture

Performance optimization of computing task scheduling based on the Hadoop big data platform

Comparative analysis of Spark and Hadoop through Imputation of Data on Big Datasets