Deep Reinforcement Learning for Job Scheduling and Resource Management in Cloud Computing: An Algorithm-Level Review

Yan Gu,Zhaoze Liu,Shuhong Dai,Cong Liu,Ying Wang,Shen Wang,Georgios Theodoropoulos,Long Cheng
2025-01-02
Abstract:Cloud computing has revolutionized the provisioning of computing resources, offering scalable, flexible, and on-demand services to meet the diverse requirements of modern applications. At the heart of efficient cloud operations are job scheduling and resource management, which are critical for optimizing system performance and ensuring timely and cost-effective service delivery. However, the dynamic and heterogeneous nature of cloud environments presents significant challenges for these tasks, as workloads and resource availability can fluctuate unpredictably. Traditional approaches, including heuristic and meta-heuristic algorithms, often struggle to adapt to these real-time changes due to their reliance on static models or predefined rules. Deep Reinforcement Learning (DRL) has emerged as a promising solution to these challenges by enabling systems to learn and adapt policies based on continuous observations of the environment, facilitating intelligent and responsive decision-making. This survey provides a comprehensive review of DRL-based algorithms for job scheduling and resource management in cloud computing, analyzing their methodologies, performance metrics, and practical applications. We also highlight emerging trends and future research directions, offering valuable insights into leveraging DRL to advance both job scheduling and resource management in cloud computing.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve are as follows: In the cloud computing environment, traditional job scheduling and resource management methods are difficult to cope with the challenges brought by dynamic and heterogeneous environments. Specifically: 1. **Dynamic and Heterogeneous Environments**: The workload and resource availability in the cloud computing environment will fluctuate unpredictably, which makes traditional methods based on static models or predefined rules difficult to adapt to real - time changes. 2. **Limitations of Traditional Methods**: Traditional heuristic and meta - heuristic algorithms (such as genetic algorithms, whale optimization algorithms, etc.) rely on prior knowledge and static optimization models and perform poorly when dealing with rapidly changing task arrival times and resource requirements. 3. **Optimizing System Performance and Ensuring Service Quality**: Effective job scheduling and resource management are crucial for optimizing system performance and ensuring timely and cost - effective service delivery. Therefore, a solution that can intelligently respond to and adapt to these changes is required. To solve these problems, the paper proposes using Deep Reinforcement Learning (DRL) as a promising solution. DRL learns and adapts strategies through continuous interaction with the environment, thereby achieving intelligent and responsive decision - making, which is specifically reflected in the following aspects: - **High Adaptability**: DRL can dynamically adjust strategies based on continuous observations of the environment to deal with unpredictable workload and resource changes. - **Optimizing Resource Utilization**: By learning the optimal strategy, DRL can improve resource utilization, enhance system performance, and improve Quality of Service (QoS). - **Reducing Operating Costs**: DRL helps to minimize operating costs while ensuring compliance with the requirements of Service - Level Agreements (SLAs). In conclusion, this paper aims to explore how to use DRL technology to improve job scheduling and resource management in the cloud computing environment to meet the challenges brought by dynamic and complex environments.