Zeal: Rethinking Large-Scale Resource Allocation with "Decouple and Decompose"

Zhiying Xu,Francis Y. Yan,Minlan Yu
2024-12-16
Abstract:Resource allocation is fundamental for cloud systems to ensure efficient resource sharing among tenants. However, the scale of such optimization problems has outgrown the capabilities of commercial solvers traditionally employed in production. To scale up resource allocation, prior approaches either tailor solutions to specific problems or rely on assumptions tied to particular workloads. In this work, we revisit real-world resource allocation problems and uncover a common underlying structure: a vast majority of these problems are inherently separable, i.e., they optimize the aggregate utility of individual resource and demand allocations, under separate constraints for each resource and each demand. Building on this insight, we develop DeDe, a general, scalable, and theoretically grounded framework for accelerating resource allocation through a "decouple and decompose" approach. DeDe systematically decouples entangled resource and demand constraints, thereby decomposing the overall optimization into alternating per-resource and per-demand allocations, which can then be solved efficiently and in parallel. We have implemented DeDe as a library extension to an open-source solver, maintaining a familiar user interface. Experimental results across three prominent resource allocation tasks -- traffic engineering, cluster scheduling, and load balancing -- demonstrate DeDe's substantial speedups and robust allocation quality.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the **scalability crisis** in large - scale resource allocation problems. Specifically, with the rapid expansion and diversification of cloud - computing environments, the scale of modern resource allocation problems has exceeded the capabilities of traditional commercial solvers. These problems may involve millions of variables, causing the solver to take tens of minutes or even hours to calculate solutions, while rapid allocation is necessary to maintain service quality. #### Main challenges 1. **Limitations of existing solvers**: Traditional commercial optimization solvers (such as Gurobi) are inefficient when dealing with large - scale resource allocation problems and cannot meet real - time requirements. 2. **Limitations of existing methods**: Previous research either customized solutions for specific domains or specific goals, or relied on assumptions of specific workloads, making it difficult to adapt to new scenarios. 3. **Complexity and coupling**: In resource allocation problems, resource constraints and demand constraints are intertwined, making it difficult to decompose the problem into smaller sub - problems that can be solved in parallel. #### Solutions The authors propose a new framework named **ZEAL** to accelerate resource allocation through the "Decouple and Decompose" method. The core ideas of ZEAL are: - **Discover the inherent separable structure**: Through the analysis of a large number of actual resource allocation problems, it is found that these problems are essentially separable, that is, they optimize the total utility of each resource and demand allocation, and each resource and demand has independent constraints. - **Decouple resource and demand constraints**: Introduce an auxiliary variable matrix \(z\), and transform the original problem into two independent sub - problems: resource - based allocation and demand - based allocation through the Alternating Direction Method of Multipliers (ADMM). - **Decompose the optimization problem**: Decompose the large problem into multiple smaller sub - problems, each of which involves only a single resource or a single demand, thereby achieving large - scale parallel computing. #### Experimental verification The experimental results show that ZEAL exhibits significant speed improvements and high - quality allocation results in three typical resource allocation tasks (traffic engineering, cluster scheduling, and load balancing). Compared with the existing optimal methods, ZEAL achieves an allocation quality improvement of 5.3% to 14% and a speed improvement of 7.6 times to 1.34 times in different tasks. ### Summary ZEAL provides a general, scalable, and theoretically guaranteed framework that can effectively solve the scalability challenges in large - scale resource allocation problems, is applicable to multiple application scenarios, and can significantly improve the solution speed while ensuring high - quality allocation.