Diagnosing Virtualized Hadoop Performance from Benchmark Results: An Exploratory Study

Jun Fan,Xinhui Li,Chi Harold Liu,Jeffrey Buell,Gavin Lu,Luke Lu
DOI: https://doi.org/10.1109/BigData.Congress.2014.89
2014-01-01
Abstract:Hadoop is emerging as one of the leading frameworks used by enterprises to help make better business decisions on large data sets. Virtualization technology brings plenty of benefits to Hadoop, including higher resource utilization and cluster reliability. However, these benefits mean nothing to users if unacceptable performance degradation happens from physical to virtual platform. Existing efforts on virtualized Hadoop performance find that improper configurations of network and storage with open sourced virtual deployment cause huge overhead on system performance. However, complexity of hardware and software including virtualization configurations and various scale of deployment also makes performance tuning still too hard a practice to execute. To span that gap of virtualized Hadoop adoption, in this paper, we propose a performance diagnostic methodology that integrates statistical analysis from different layers, and design a heuristic performance diagnostic tool which evaluates the validity and correctness of virtualized Hadoop by analyzing the job traces of popular big data benchmarks. By using this tool, users could quickly identify the bottleneck according to hints provided by this tool, further confirm the diagnosis by referring to performance utilities provided by guest OS and hypervisor, and continue tuning performance for virtualized Hadoop by multiple runs of this tool.
What problem does this paper attempt to address?