Batch Jobs Load Balancing Scheduling in Cloud Computing Using Distributional Reinforcement Learning

Tiangang Li,Shi Ying,Yishi Zhao,Jianga Shang
DOI: https://doi.org/10.1109/tpds.2023.3334519
IF: 5.3
2023-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:In cloud computing, how to reasonably allocate computing resources for batch jobs to ensure the load balance of dynamic clusters and meet user requests is an important and challenging task. Most existing studies are based on deep Q network, which utilizes neural networks to estimate the expected value of cumulative return in the scheduling process. The value-based DQN algorithms ignore the complete information contained in the value distribution and lack strong adaptability to time-varying batch jobs and dynamic cluster resources. Therefore, to capture the inherent stochasticity of the scheduling process caused by environmental stochasticity, we utilize Distributional Reinforcement Learning to model the value distribution of the cumulative return. Specifically, we formalize the load balancing scheduling as a multi-objective optimization problem and construct a Distributional Reinforcement Learning model. Then we introduce quantile regression to learn the value distribution of the cumulative return during scheduling and propose a dynamic load balancing scheduling algorithm based on Distributional Reinforcement Learning. In addition, we develop a cluster environment for real-time processing of batch jobs to simulate the arrival of batch jobs and train the Distributional Reinforcement Learning-based scheduling agent. We conduct empirical experiments and detailed analysis by using the real Alibaba Cluster cluster traces v2018 and v2020. The results show that compared to the baseline algorithms, the proposed algorithm performs better in terms of cluster load balancing, success rate of instance creation and average completion time of the tasks. The experimental results on different trace datasets also indicate that the propsoed algorithm exhibits excellent scalability.
What problem does this paper attempt to address?