Abstract:INFerence-as-a-Service (INFaaS) has become a primary workload in the cloud. However, existing FPGA-based Deep Neural Network (DNN) accelerators are mainly optimized for the fastest speed of a single task, while the multi-tenancy of INFaaS has not been explored yet. As the demand for INFaaS keeps growing, simply increasing the number of FPGA-based DNN accelerators is not cost-effective, while merely sharing these single-task optimized DNN accelerators in a time-division multiplexing way could lead to poor isolation and high-performance loss for INFaaS. On the other hand, current cloud-based DNN accelerators have excessive compilation overhead, especially when scaling out to multi-FPGA systems for multi-tenant sharing, leading to unacceptable compilation costs for both offline deployment and online reconfiguration. Therefore, it is far from providing efficient and flexible FPGA virtualization for public and private cloud scenarios. Aiming to solve these problems, we propose a unified virtualization framework for general-purpose deep neural networks in the cloud, enabling multi-tenant sharing for both the Convolution Neural Network (CNN), and the Recurrent Neural Network (RNN) accelerators on a single FPGA. The isolation is enabled by introducing a two-level instruction dispatch module and a multi-core based hardware resources pool. Such designs provide isolated and runtime-programmable hardware resources, which further leads to performance isolation for multi-tenant sharing. On the other hand, to overcome the heavy re-compilation overheads, a tiling-based instruction frame package design and a two-stage static-dynamic compilation, are proposed. Only the lightweight runtime information is re-compiled with ∼1 ms overhead, thus guaranteeing the private cloud’s performance. Finally, the extensive experimental results show that the proposed virtualized solutions achieve up to 3.12× and 6.18× higher throughput in the private cloud compared with the static CNN and RNN baseline designs, respectively.

POSTER: FineCo: Fine-grained Heterogeneous Resource Management for Concurrent DNN Inferences

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

CODA: Improving Resource Utilization by Slimming and Co-locating DNN and CPU Jobs

A Novel Co-design Peta-scale Heterogeneous Cluster for Deep Learning Training

Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference

Improving Cluster Utilization Through Adaptive Resource Management for Deep Neural Network and CPU Jobs Colocation

Effective GPU Sharing Under Compiler Guidance

Simultaneous Multikernel: Fine-Grained Sharing of GPUs.

ElasticRoom: Multi-Tenant DNN Inference Engine Via Co-design with Resource-constrained Compilation and Strong Priority Scheduling

TurboDL: Improving the CNN Training on GPU with Fine-Grained Multi-Streaming Scheduling

Quality of Service Support for Fine-Grained Sharing on GPUs.

GACER: Granularity-Aware ConcurrEncy Regulation for Multi-Tenant Deep Learning

CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices

CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system

Joint Optimization of Device Placement and Model Partitioning for Cooperative DNN Inference in Heterogeneous Edge Computing

Coarse-to-Fine: A hierarchical DNN inference framework for edge computing

Sub-model Parallelism: A Scale-out Deployment Method for Large Multi-modal DNNs

Accelerating DNN Inference with Heterogeneous Multi-DPU Engines

Work-in-Progress: Furion: Alleviating Overheads for Deep Learning Framework on Single Machine

Multi-Fine-Grained DNNs Partition and Offloading over Fog Computing Networks

A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud