Abstract:In recent years, the number of clusters and cloud platforms dedicated to deep learning acceleration has increased, and research on multi-tenant deep learning (DL) cluster scheduling systems has also advanced quickly. However, we have observed several shortcomings in these systems. Firstly, resources exhibit heterogeneity, but even the most advanced heterogeneity-aware schedulers can only reach the GPU-type level. In addition, most scheduling systems cannot perform well in balancing efficiency and fairness, which leads to unfair resource allocation and reduced user satisfaction. Moreover, we have noticed the phenomenon of cluster fragmentation and job starvation. In this paper, we propose a new scheduling architecture: Hops, which includes (1) fine-grained heterogeneity awareness and accurate throughput estimators, which allows for heterogeneity awareness at the server entity level. (2) Hops performs resource allocation by executing prior weighted integer linear programming (ILP) for specific placement locations, effectively balancing fairness and efficiency. (3) Hops introduces "latency ratio fairness" (LRF) as a user fairness criterion, which helps reduce starvation and enhance user experience. (4) To address cluster fragmentation, Hops intentionally uses low-sensitivity jobs to fill fragments. The final experimental results show that, in physical experiments, compared with the state-of-the-art scheduling architectures: Sia [17] and Gavel [32], Hops reduces cluster completion time by 18.5% to 34.2%, shortens average job completion time (JCT) by 27.4% to 45.9%, lowers waiting latency by 35.4% to 54.9%, significantly reduces cluster fragmentation, and performs significantly better in fairness metrics compared to Sia and Gavel. In the 512-GPU simulation experiments, Hops not only improves system efficiency but also reduces the maximum job latency ratio by over 21× and decreases cluster fragmentation to less than 1 GPU per round on average.

Hop: Heterogeneity-Aware Decentralized Training

Towards Efficient Scheduling of Federated Mobile Devices under Computational and Statistical Heterogeneity

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

Heterogeneity-Aware Distributed Parameter Servers

Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems

Hops: Fine-grained Heterogeneous Sensing, Efficient and Fair Deep Learning Cluster Scheduling System

SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training.

Task allocation for decentralized training in heterogeneous environment

Quasi-Global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Heterogeneity-Aware Gradient Coding for Tolerating and Leveraging Stragglers

Hoplite: efficient and fault-tolerant collective communication for task-based distributed systems

Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs over Heterogeneous Infrastructure

Position: Exploring the Robustness of Pipeline-Parallelism-Based Decentralized Training

FedGSync: Jointly Optimized Weak Synchronization and Gradient Transmission for Fast Distributed Machine Learning in Heterogeneous WAN

RCD-SGD: Resource-Constrained Distributed SGD in Heterogeneous Environment via Submodular Partitioning

HPSGD: Hierarchical Parallel SGD with Stale Gradients Featuring

Decentralized Training of Foundation Models in Heterogeneous Environments

Adaptive Configuration for Heterogeneous Participants in Decentralized Federated Learning

Communication compression for decentralized training

QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices