Abstract:MapReduce has been widely used for large-scale data analysis in the Cloud. The system is well recognized for its elastic scalability and fine-grained fault tolerance although its performance has been noted to be suboptimal in the database context. According to a recent study [19], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. MapReduce can achieve better performance with the allocation of more compute nodes from the cloud to speed up computation; however, this approach of "renting more nodes" is not cost effective in a pay-as-you-go environment. Users desire an economical elastically scalable data processing system, and therefore, are interested in whether MapReduce can offer both elastic scalability and efficiency. In this paper, we conduct a performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. We identify five design factors that affect the performance of Hadoop, and investigate alternative but known methods for each factor. We show that by carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5 for the same benchmark used in [19], and is thus more comparable to that of parallel database systems. Our results show that it is therefore possible to build a cloud data processing system that is both elastically scalable and efficient.

An experimental approach towards big data for analyzing memory utilization on a hadoop cluster using HDFS and MapReduce

An Approach in Big Data Analytics to Improve the Velocity of Unstructured Data Using MapReduce

Distributed data management using MapReduce

Data Management Techniques in Hadoop Framework for Handling Small Files: A Survey

Hadoop Distributed File System for Big data analysis

A Proposed Approach for Improving Hadoop Performance for Handling Small Files

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Unstructured Data Analysis on Big Data Using Map Reduce

A Survey on Geographically Distributed Big-Data Processing using MapReduce

The performance of MapReduce: an in-depth study

Data Processing Framework Using Apache and Spark Technologies in Big Data

Hadoop, MapReduce and HDFS: A Developers Perspective

Map Reduce for big data processing based on traffic aware partition and aggregation

The Performance of MapReduce

Past, Present and Future of Hadoop: A Survey

Performance evaluation of Map-reduce jar pig hive and spark with machine learning using big data

Massive Image Data Management Using Hbase And Mapreduce

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

Large-scale Data Modelling in Hive and Distributed Query Processing using MapReduce and Tez

Column-Oriented Storage Techniques for MapReduce

Storage-Optimization Method for Massive Small Files of Agricultural Resources Based on Hadoop