Abstract:This paper addresses the important need for advanced techniques in continuously allocating workloads on shared infrastructures in data centers, a problem arising due to the growing popularity and scale of cloud computing. It particularly emphasizes the scarcity of research ensuring guaranteed capacity in capacity reservations during large-scale failures. To tackle these issues, the paper presents scalable solutions for resource management. It builds on the prior establishment of capacity reservation in cluster management systems and the two-level resource allocation problem addressed by the Resource Allowance System (RAS). Recognizing the limitations of Mixed Integer Linear Programming (MILP) for server assignment in a dynamic environment, this paper proposes the use of Deep Reinforcement Learning (DRL), which has been successful in achieving long-term optimal results for time-varying systems. A novel two-level design that utilizes a DRL-based algorithm is introduced to solve optimal server-to-reservation assignment, taking into account of fault tolerance, server movement minimization, and network affinity requirements due to the impracticality of directly applying DRL algorithms to large-scale instances with millions of decision variables. The paper explores the interconnection of these levels and the benefits of such an approach for achieving long-term optimal results in the context of large-scale cloud systems. We further show in the experiment section that our two-level DRL approach outperforms the MIP solver and heuristic approaches and exhibits significantly reduced computation time compared to the MIP solver. Specifically, our two-level DRL approach performs 15% better than the MIP solver on minimizing the overall cost. Also, it uses only 26 seconds to execute 30 rounds of decision making, while the MIP solver needs nearly an hour.

Elastic Task Offloading and Resource Allocation over Hybrid Cloud: A Reinforcement Learning Approach

Two-tiered Online Optimization of Region-wide Datacenter Resource Allocation via Deep Reinforcement Learning

Resource Management in Cloud Based on Deep Reinforcement Learning

Adaptive DRL-Based Task Scheduling for Energy-Efficient Cloud Computing

A Task Offloading and Resource Allocation Optimization Method in End-Edge-Cloud Orchestrated Computing

Task Offloading And Resource Scheduling In Hybrid Edge-Cloud Networks

H2O-Cloud: A Resource and Quality of Service-Aware Task Scheduling Framework for Warehouse-Scale Data Centers -- A Hierarchical Hybrid DRL (Deep Reinforcement Learning) based Approach

Task Scheduling and Resource Allocation Based on Ant-Colony Optimization and Deep Reinforcement Learning

Dynamic Job Scheduling On Scalable Cloud Resources

Long-Term Multi-objective Task Scheduling with Diff-Serv in Hybrid Clouds

An Optimal Resource Allocator of Elastic Training for Deep Learning Jobs on Cloud

Adaptive and Efficient Resource Allocation in Cloud Datacenters Using Actor-Critic Deep Reinforcement Learning

Hybrid Edge-Cloud Collaborator Resource Scheduling Approach Based on Deep Reinforcement Learning and Multiobjective Optimization

Efficient Resource Allocation Policy for Cloud Edge End Framework by Reinforcement Learning

Multi-Agent Deep Reinforcement Learning Offloading Algorithm Based on Resource Reservation and Task Adjustment

Dynamic Load Balancing in Cloud Computing: Optimized RL-Based Clustering with Multi-Objective Optimized Task Scheduling

Hybrid Deep Reinforcement Learning-Based Task Offloading for D2D-Assisted Cloud-Edge-Device Collaborative Networks

Job Scheduling in Hybrid Clouds With Privacy Constraints: A Deep Reinforcement Learning Approach

A Novel Two-Layered Reinforcement Learning for Task Offloading with Tradeoff Between Physical Machine Utilization Rate and Delay

Joint Task Assignment and Migration in Cloud-Edge-End Collaborative Computing Based on DRL

Intelligent Resource Allocation for Edge-Cloud Collaborative Networks: A Hybrid DDPG-D3QN Approach