Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction

Zhiyao Hu,Dongsheng Li,Dongxiang Zhang,Yiming Zhang,Baoyun Peng
DOI: https://doi.org/10.1109/tpds.2021.3055019
IF: 5.3
2021-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Under-allocating or over-allocating computation resources (e.g., CPU cores) can prolong the completion time of data-parallel jobs in a distributed system. We present a predictor, ReLocag, to find the near-optimal number of CPU cores to minimize job completion time (JCT). ReLocag includes a graph convolutional network (GCN) and a fully-connected network (FCNN). The GCN learns the dependency between operations from the workflow of a job, and then the FCNN takes the workflow dependency together with other features (e.g., the input size, the number of CPU cores, the amount of memory, and the number of computation tasks) as input for JCT prediction. The prediction result can guide the user to determine the near-optimal number of CPU cores. Besides, we propose two effective strategies to overcome the time-consuming issue of training sample collection in big data applications. First, we develop an adaptive sampling method to collect essential samples judiciously. Second, we further design a cross-application transfer learning model to exploit the training samples collected from other applications. We conduct extensive experiments in a Spark cluster for 7 types of exemplary Spark applications. Results show that ReLocag improves the JCT prediction accuracy by 4—14 percent. Moreover, the CPU core consumption decreases by 58.2 percent.
What problem does this paper attempt to address?