Measuring the Optimality of Hadoop Optimization

Woo-Cheol Kim,Changryong Baek,Dongwon Lee
DOI: https://doi.org/10.48550/arXiv.1307.2915
2013-07-10
Distributed, Parallel, and Cluster Computing
Abstract:In recent years, much research has focused on how to optimize Hadoop jobs. Their approaches are diverse, ranging from improving HDFS and Hadoop job scheduler to optimizing parameters in Hadoop configurations. Despite their success in improving the performance of Hadoop jobs, however, very little is known about the limit of their optimization performance. That is, how optimal is a given Hadoop optimization? When a Hadoop optimization method X improves the performance of a job by Y %, how do we know if this improvement is as good as it can be? To answer this question, in this paper, we first examine the ideal best case, the lower bound, of running time for Hadoop jobs and develop a measure to accurately estimate how optimal a given Hadoop optimization is with respect to the lower bound. Then, we demonstrate how one may exploit the proposed measure to improve the optimization of Hadoop jobs.
What problem does this paper attempt to address?