Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms

Yuqi Xue,Yiqi Liu,Lifeng Nai,Jian Huang
2024-09-13
Abstract:Cloud platforms today have been deploying hardware accelerators like neural processing units (NPUs) for powering machine learning (ML) inference services. To maximize the resource utilization while ensuring reasonable quality of service, a natural approach is to virtualize NPUs for efficient resource sharing for multi-tenant ML services. However, virtualizing NPUs for modern cloud platforms is not easy. This is not only due to the lack of system abstraction support for NPU hardware, but also due to the lack of architectural and ISA support for enabling fine-grained dynamic operator scheduling for virtualized NPUs. We present Neu10, a holistic NPU virtualization framework. We investigate virtualization techniques for NPUs across the entire software and hardware stack. Neu10 consists of (1) a flexible NPU abstraction called vNPU, which enables fine-grained virtualization of the heterogeneous compute units in a physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go computing model and flexible vNPU-to-pNPU mappings for improved resource utilization and cost-effectiveness; (3) an ISA extension of modern NPU architecture for facilitating fine-grained tensor operator scheduling for multiple vNPUs. We implement Neu10 based on a production-level NPU simulator. Our experiments show that Neu10 improves the throughput of ML inference services by up to 1.4$\times$ and reduces the tail latency by up to 4.6$\times$, while improving the NPU utilization by 1.2$\times$ on average, compared to state-of-the-art NPU sharing approaches.
Hardware Architecture,Artificial Intelligence,Machine Learning,Operating Systems
What problem does this paper attempt to address?
The paper aims to address the issues of low resource utilization and resource sharing difficulties of Neural Processing Units (NPUs) in cloud computing platforms. Specifically: 1. **Resource Utilization Issue**: Currently, cloud computing platforms typically allocate an entire NPU chip to a single machine learning (ML) application instance. This practice leads to severe resource wastage because many deep neural network (DNN) inference workloads cannot fully utilize the matrix engines (MEs) and vector engines (VEs) on the NPU chip. 2. **Resource Sharing and Management Challenges**: To improve resource utilization and simplify resource management, it is necessary to virtualize hardware devices so that multiple tenants can share resources. However, modern cloud platforms lack system abstraction support, architectural support, and instruction set architecture (ISA) support for NPU virtualization, making it difficult to achieve fine-grained dynamic scheduling and resource allocation. To address the above issues, the authors propose the Neu10 framework, a comprehensive NPU virtualization solution. Neu10 introduces flexible virtual NPU (vNPU) abstractions, new resource allocation mechanisms, and extended ISAs to support fine-grained multi-tenant workload scheduling, thereby improving NPU utilization and performance isolation. Experimental results show that compared to existing state-of-the-art NPU sharing methods, Neu10 significantly improves ML inference service throughput, reduces tail latency, and enhances NPU utilization.