DistMind: Efficient Resource Disaggregation for Deep Learning Workloads

Xin Jin,Zhihao Bai,Zhen Zhang,Yibo Zhu,Yinmin Zhong,Xuanzhe Liu
DOI: https://doi.org/10.1109/tnet.2024.3355010
2024-01-01
IEEE/ACM Transactions on Networking
Abstract:Deep learning (DL) systems suffer from low resource utilization due to 1) monolithic server model that tightly couples compute and memory; and 2) limited sharing between different inference applications, and across inference and training, because of strict service level objectives (SLOs). To address this problem, we present DistMind, a disaggregated DL system that enables efficient multiplexing of DL applications with near-optimal resource utilization. DistMind decouples compute from host memory, and exposes the abstractions of a GPU pool and a memory pool, each of which can be independently provisioned. The key challenge is to dynamically allocate GPU resources to different applications based on their real-time demands while meeting strict SLOs. We tackle this challenge by exploiting the power of high-speed 100 Gbps networks, and design three-stage pipelining, cache-aware load balancing, and DNN-aware sharding mechanisms based on the characteristics of DL workloads, to achieve millisecond-scale application loading overhead and improve system efficiency. We have implemented a prototype of DistMind and integrated it with PyTorch. Experimental results on AWS EC2 show that DistMind achieves near 100% resource utilization, and compared with NVIDIA MPS and Ray, DistMind improves the throughput by up to 279% and reduces the inference latency by up to 94%.
telecommunications,computer science, theory & methods,engineering, electrical & electronic, hardware & architecture
What problem does this paper attempt to address?