Scalability Evaluation of Cluster Size for MapReduce Applications in Elastic Compute Clouds

Fan Zhang,Majd F Sakr
DOI: https://doi.org/10.5339/qfarf.2012.csp38
2012-01-01
Abstract:The MapReduce programming model is a widely accepted solution to address the rapid growth of the so-called big-data processing demands. Various MapReduce applications with a huge volume of input data can run on an elastic compute cloud composed of many computing instances. This elastic compute cloud is best represented by a virtual cluster, such as Amazon EC2. Performance prediction of MapReduce applications would help in understanding their scalability pattern. However, it is challenging due to the complex interaction of the MapReduce framework and the underlying highly-parameterized virtualized resources. Furthermore, MapReduce's high-dimension space of configuration paremeters which adds to the prediction complexity. We have evaluated a series of representative MapReduce applications on Amazon EC2, and identified how the cluster size affects the execution times. The scaling curve of all applications are studied to discover the scalability pattern. Our major findings are as follows: (1) The execution times of MapReduce applications follow a power-law distribution, (2) For map-intensive applications, the power-law scalability starts from a small cluster size, and (3) For reduce-intensive applications, the power-law scalability starts from a lager cluster size. We attempted to fit our scalability performance results using three regression methods: polynomial regression, exponential regression and power regression. By measuring the Root Squared Mean Error (RSME), the power regression performs best at performance prediction compared with the other methods evaluated. This was the case across all the benchmark applications studied. Our performance prediction methods will aid cloud users in choosing appropriate computing resources, both virtual and physical, from small-scale experimental test runs for cost saving.
What problem does this paper attempt to address?