ReLoca: Optimize Resource Allocation for Data-parallel Jobs Using Deep Learning

Zhiyao Hu,Dongsheng Li,Dongxiang Zhang,Yixin Chen
DOI: https://doi.org/10.1109/infocom41043.2020.9155521
2020-01-01
Abstract:Since under-allocating computation resource (e.g., CPU cores) causes suboptimal JCTs of data-parallel jobs, users are inclined to request excessive computation resource to decrease JCTs. However, over-allocating computation resource for data-parallel jobs incurs considerable system overheads (e.g., network communication and disk I/O overhead), which prolong the job completion time. In this paper, we propose ReLoca towards the optimal allocation of computation resource with the objective of minimizing the job completion time. ReLoca employs a deep neural network to guide the allocation of computation resource, by learning the impact of the operations in data-parallel jobs on the system overhead and computation time. Since training samples are time-consuming to collect, we develop an adaptive sampling method to preferably collect high-quality samples and thus overcome the issue of data scarcity. We apply ReLoca to improve Spark and conduct real experiments with five typical applications in big data analytics. Results show that ReLoca significantly reduces the average job completion time. Compared with the state-of-the-art method, ReLoca has higher prediction accuracy, needs fewer training samples and decreases the sampling overhead. With the prediction by ReLoca, the JCT decreases by 29.85%.
What problem does this paper attempt to address?