Abstract:This paper addresses the important need for advanced techniques in continuously allocating workloads on shared infrastructures in data centers, a problem arising due to the growing popularity and scale of cloud computing. It particularly emphasizes the scarcity of research ensuring guaranteed capacity in capacity reservations during large-scale failures. To tackle these issues, the paper presents scalable solutions for resource management. It builds on the prior establishment of capacity reservation in cluster management systems and the two-level resource allocation problem addressed by the Resource Allowance System (RAS). Recognizing the limitations of Mixed Integer Linear Programming (MILP) for server assignment in a dynamic environment, this paper proposes the use of Deep Reinforcement Learning (DRL), which has been successful in achieving long-term optimal results for time-varying systems. A novel two-level design that utilizes a DRL-based algorithm is introduced to solve optimal server-to-reservation assignment, taking into account of fault tolerance, server movement minimization, and network affinity requirements due to the impracticality of directly applying DRL algorithms to large-scale instances with millions of decision variables. The paper explores the interconnection of these levels and the benefits of such an approach for achieving long-term optimal results in the context of large-scale cloud systems. We further show in the experiment section that our two-level DRL approach outperforms the MIP solver and heuristic approaches and exhibits significantly reduced computation time compared to the MIP solver. Specifically, our two-level DRL approach performs 15% better than the MIP solver on minimizing the overall cost. Also, it uses only 26 seconds to execute 30 rounds of decision making, while the MIP solver needs nearly an hour.

Deep Reinforcement Learning for Intelligent Cloud Resource Management

Intelligent Cloud Resource Management with Deep Reinforcement Learning.

Deep reinforcement learning-based methods for resource scheduling in cloud computing: a review and future directions

A Deep Reinforcement Learning-Based Model for Optimal Resource Allocation and Task Scheduling in Cloud Computing

Two-tiered Online Optimization of Region-wide Datacenter Resource Allocation via Deep Reinforcement Learning

Deep Reinforcement Learning for Online Resource Allocation in IoT Networks: Technology, Development, and Future Challenges

Cloud Resource Scheduling with Deep Reinforcement Learning and Imitation Learning

Energy efficient task scheduling based on deep reinforcement learning in cloud environment: A specialized review

Research on Cloud Computing Resources Provisioning Based on Reinforcement Learning

Deep Reinforcement Learning: Framework, Applications, and Embedded Implementations

Energy-aware systems for real-time job scheduling in cloud data centers: A deep reinforcement learning approach

DRL-Scheduling: an Intelligent QoS-Aware Job Scheduling Framework for Applications in Clouds

Resource Allocation with Workload-Time Windows for Cloud-Based Software Services: A Deep Reinforcement Learning Approach

Deep Reinforcement Learning Based Resource Allocation Strategy in Cloud-Edge Computing System

Deep reinforcement learning based resource allocation in edge-cloud gaming

Scheduling of decentralized robot services in cloud manufacturing with deep reinforcement learning

H2O-Cloud: A Resource and Quality of Service-Aware Task Scheduling Framework for Warehouse-Scale Data Centers -- A Hierarchical Hybrid DRL (Deep Reinforcement Learning) based Approach

A2C-DRL: Dynamic Scheduling for Stochastic Edge-Cloud Environments Using A2C and Deep Reinforcement Learning

A DRL-Driven Intelligent Optimization Strategy for Resource Allocation in Cloud-Edge-End Cooperation Environments

Deep-Reinforcement-Learning-Based Resource Allocation for Cloud Gaming via Edge Computing