ElasticRoom: Multi-Tenant DNN Inference Engine Via Co-design with Resource-constrained Compilation and Strong Priority Scheduling

Lixian Ma,Haoruo Chen,En Shao,Leping Wang,Quan Chen,Guangming Tan
DOI: https://doi.org/10.1145/3625549.3658654
2024-01-01
Abstract:GPU partition mechanisms in run-time software have been widely used in job scheduler and multi-tenant computing system to improve resource utilization and throughput. The latency requirements of different DNN requests, such as real-time and best-effort requests, often exhibit variations in computational systems that handle batch tasks for DNN inference. However, the existing GPU partition mechanisms and state-of-the-art scheduling approaches face challenges in effectively promising both high throughput and low latency for real-time requests. The current limitation lies in the inability of existing GPU partition mechanisms to enhance GPU resource utilization and ensure job priority simultaneously. In this paper, we present an innovative multi-tenant DNN inference engine, ElasticRoom, which relies on the co-design with resource-constrained compilation and strong priority scheduling to achieve high GPU utilization and low latency of real-time requests simultaneously. To ensure portability across diverse manufacturers' accelerator hardware, ElasticRoom does not rely on any customization or pre-set features in the hardware or operating system. To quantify the ability of DNN inference computing systems to process and meet performance requirements for a batch of real-time DNN inference requests within a valid time, we define the concept of Goodput for each batch of inference requests. The performance of ElasticRoom was assessed on both NVIDIA GPUs (A100) and AMD GPUs (MI100), revealing significant enhancements in Goodput ranging from 14% to 49% compared to well-established state-of-the-art methods.
What problem does this paper attempt to address?