Balancing Job Performance with System Performance Via Locality-Aware Scheduling on Torus-Connected Systems

Xu Yang,Zhou,Wei Tang,Xingwu Zheng,Jia Wang,Zhiling Lan
DOI: https://doi.org/10.1109/cluster.2014.6968751
2014-01-01
Cluster Computing
Abstract:Torus-connected network is widely used in modern supercomputers due to its linear per node cost scaling and its competitive overall performance. Job scheduling system plays a critical role for the efficient use of supercomputers. As supercomputers continue growing in size, a fundamental problem arises: how to effectively balance job performance with system performance on torus-connected machines? In this work, we will present a new scheduling design named window-based locality-aware scheduling. Our design contains three novel features. First, rather than oneby-one job scheduling, our design takes a "window" of jobs, i.e. multiple jobs, into consideration for job prioritizing and resource allocation. Second, our design maintains a list of slots to preserve node contiguity information for resource allocation. Finally, we formulate our scheduling decision making into a 0-1 Multiple Knapsack Problem and present two algorithms to solve the problem. A series of trace-based simulations using job logs collected from production supercomputers indicate that this new scheduling design has real potentials and can effectively balance job performance and system performance.
What problem does this paper attempt to address?