Abstract:This paper addresses the important need for advanced techniques in continuously allocating workloads on shared infrastructures in data centers, a problem arising due to the growing popularity and scale of cloud computing. It particularly emphasizes the scarcity of research ensuring guaranteed capacity in capacity reservations during large-scale failures. To tackle these issues, the paper presents scalable solutions for resource management. It builds on the prior establishment of capacity reservation in cluster management systems and the two-level resource allocation problem addressed by the Resource Allowance System (RAS). Recognizing the limitations of Mixed Integer Linear Programming (MILP) for server assignment in a dynamic environment, this paper proposes the use of Deep Reinforcement Learning (DRL), which has been successful in achieving long-term optimal results for time-varying systems. A novel two-level design that utilizes a DRL-based algorithm is introduced to solve optimal server-to-reservation assignment, taking into account of fault tolerance, server movement minimization, and network affinity requirements due to the impracticality of directly applying DRL algorithms to large-scale instances with millions of decision variables. The paper explores the interconnection of these levels and the benefits of such an approach for achieving long-term optimal results in the context of large-scale cloud systems. We further show in the experiment section that our two-level DRL approach outperforms the MIP solver and heuristic approaches and exhibits significantly reduced computation time compared to the MIP solver. Specifically, our two-level DRL approach performs 15% better than the MIP solver on minimizing the overall cost. Also, it uses only 26 seconds to execute 30 rounds of decision making, while the MIP solver needs nearly an hour.

Reinforcement Learning for Optimal Load Distribution Sequencing in Resource-Sharing System

Optimal Divisible Load Scheduling for Resource-Sharing Network

Two-tiered Online Optimization of Region-wide Datacenter Resource Allocation via Deep Reinforcement Learning

Distributed Scheduling Strategies for Processing Multiple Divisible Loads with Unknown Network Resources

RLPTO : A reinforcement learning-based performance-time optimized task and resource scheduling mechanism for distributed machine learning

A distributed scheduling strategy for multiple divisible loads with unknown network resources

Reinforcement Learning for Adaptive Resource Scheduling in Complex System Environments

The Optimized Reinforcement Learning Approach to Run-Time Scheduling in Data Center.

Optimizing Load Scheduling in Power Grids Using Reinforcement Learning and Markov Decision Processes

Dynamic Offloading Loading Optimization in distributed Fault Diagnosis system with Deep Reinforcement Learning Approach

Energy-aware Task Scheduling Optimization with Deep Reinforcement Learning for Large-Scale Heterogeneous Systems

Optimal Energy System Scheduling Using A Constraint-Aware Reinforcement Learning Algorithm

A Deep Reinforcement Learning Approach for Optimal Scheduling of Heavy-haul Railway

A3C-DO: A Regional Resource Scheduling Framework Based on Deep Reinforcement Learning in Edge Scenario

Resource Allocation in Disaggregated Data Centre Systems with Reinforcement Learning

A2C-DRL: Dynamic Scheduling for Stochastic Edge-Cloud Environments Using A2C and Deep Reinforcement Learning

Periodic multi-installment algorithm for divisible load scheduling

Learning-based Two-tiered Online Optimization of Region-wide Datacenter Resource Allocation

Optimal dispatch of integrated energy system based on deep reinforcement learning

Deep Reinforcement Learning-Based Resource Allocation for Content Distribution in IoT-Edge-Cloud Computing Environments

An Optimization Model for Divisible-Load Scheduling Considering Processor Time-Window