Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms

Yuqi Xue,Yiqi Liu,Lifeng Nai,Jian Huang

2024-09-13

Abstract:Cloud platforms today have been deploying hardware accelerators like neural processing units (NPUs) for powering machine learning (ML) inference services. To maximize the resource utilization while ensuring reasonable quality of service, a natural approach is to virtualize NPUs for efficient resource sharing for multi-tenant ML services. However, virtualizing NPUs for modern cloud platforms is not easy. This is not only due to the lack of system abstraction support for NPU hardware, but also due to the lack of architectural and ISA support for enabling fine-grained dynamic operator scheduling for virtualized NPUs. We present Neu10, a holistic NPU virtualization framework. We investigate virtualization techniques for NPUs across the entire software and hardware stack. Neu10 consists of (1) a flexible NPU abstraction called vNPU, which enables fine-grained virtualization of the heterogeneous compute units in a physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go computing model and flexible vNPU-to-pNPU mappings for improved resource utilization and cost-effectiveness; (3) an ISA extension of modern NPU architecture for facilitating fine-grained tensor operator scheduling for multiple vNPUs. We implement Neu10 based on a production-level NPU simulator. Our experiments show that Neu10 improves the throughput of ML inference services by up to 1.4$\times$ and reduces the tail latency by up to 4.6$\times$, while improving the NPU utilization by 1.2$\times$ on average, compared to state-of-the-art NPU sharing approaches.

Hardware Architecture,Artificial Intelligence,Machine Learning,Operating Systems

What problem does this paper attempt to address?

The paper aims to address the issues of low resource utilization and resource sharing difficulties of Neural Processing Units (NPUs) in cloud computing platforms. Specifically: 1. **Resource Utilization Issue**: Currently, cloud computing platforms typically allocate an entire NPU chip to a single machine learning (ML) application instance. This practice leads to severe resource wastage because many deep neural network (DNN) inference workloads cannot fully utilize the matrix engines (MEs) and vector engines (VEs) on the NPU chip. 2. **Resource Sharing and Management Challenges**: To improve resource utilization and simplify resource management, it is necessary to virtualize hardware devices so that multiple tenants can share resources. However, modern cloud platforms lack system abstraction support, architectural support, and instruction set architecture (ISA) support for NPU virtualization, making it difficult to achieve fine-grained dynamic scheduling and resource allocation. To address the above issues, the authors propose the Neu10 framework, a comprehensive NPU virtualization solution. Neu10 introduces flexible virtual NPU (vNPU) abstractions, new resource allocation mechanisms, and extended ISAs to support fine-grained multi-tenant workload scheduling, thereby improving NPU utilization and performance isolation. Experimental results show that compared to existing state-of-the-art NPU sharing methods, Neu10 significantly improves ML inference service throughput, reduces tail latency, and enhances NPU utilization.

Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud

3M-AI: A Multi-task and Multi-core Virtualization Framework for Multi-FPGA AI Systems in the Cloud

VPU-EM: An Event-based Modeling Framework to Evaluate NPU Performance and Power Efficiency at Scale

Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud

NPU-Accelerated Imitation Learning for Thermal Optimization of QoS-Constrained Heterogeneous Multi-Cores

A Heterogeneous Full-stack AI Platform for Performance Monitoring and Hardware-specific Optimizations

NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units

Heterogeneous Systems with Reconfigurable Neuromorphic Computing Accelerators

PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units

UIC: A Unified and Scalable Chip Integrating Neuromorphic Computation and General Purpose Processor

Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference

Design-Technology Co-Optimization for NVM-based Neuromorphic Processing Elements

Neural Architecture Search of Hybrid Models for NPU-CIM Heterogeneous AR/VR Devices

Benchmarking Edge AI Platforms for High-Performance ML Inference

HIPU: A Hybrid Intelligent Processing Unit With Fine-Grained ISA for Real-Time Deep Neural Network Inference Applications

Utilizing cloud FPGAs towards the open neural network standard

PuDianNao: A Polyvalent Machine Learning Accelerator

A heterogeneous computing system with memristor-based neuromorphic accelerators

Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators