A two steps method of resources utilization predication for large Hadoop data center

Lei Yu,Fei Teng,Shangming Ning,Yunshu Li,Zhe Cui,Shengdong Du
DOI: https://doi.org/10.1002/cpe.5634
2020-01-01
Abstract:With the increase of data processing and Hadoop data center construction requirements, the performance of Hadoop data center is limited by inappropriate resources utilization. This paper introduces a new method to predict utilization for large-scale Hadoop clusters. The new method adopts a two steps model, which includes Hadoop applications' performance simulation and resources utilization prediction. For performance simulation, a new simulator, which integrates baseline test and multilayered network model, is introduced and implemented. A resources utilization predictor is proposed in the second step. By analyzing the pattern of resources utilization, a single task model is proposed. A parallel-batch-task-based (PBT) model, which represents the behavior of real Hadoop applications by integrating the single task model, is introduced. Two test scenarios are configured to verify the performance of our method. For the data center scenario, Terasort, Wordcount, and Hive are selected as benchmarks. In the virtual machines scenario, Terasort is used as benchmark. The experiments show that the error comparing between the simulator results and experimental environment results in most cases is less than 10%. The results confirm that we can locate the resource bottleneck for Hadoop clusters, meanwhile we can agilely configure clusters for applications with massive data.
What problem does this paper attempt to address?