Abstract:INFerence-as-a-Service (INFaaS) has become a primary workload in the cloud. However, existing FPGA-based Deep Neural Network (DNN) accelerators are mainly optimized for the fastest speed of a single task, while the multi-tenancy of INFaaS has not been explored yet. As the demand for INFaaS keeps growing, simply increasing the number of FPGA-based DNN accelerators is not cost-effective, while merely sharing these single-task optimized DNN accelerators in a time-division multiplexing way could lead to poor isolation and high-performance loss for INFaaS. On the other hand, current cloud-based DNN accelerators have excessive compilation overhead, especially when scaling out to multi-FPGA systems for multi-tenant sharing, leading to unacceptable compilation costs for both offline deployment and online reconfiguration. Therefore, it is far from providing efficient and flexible FPGA virtualization for public and private cloud scenarios. Aiming to solve these problems, we propose a unified virtualization framework for general-purpose deep neural networks in the cloud, enabling multi-tenant sharing for both the Convolution Neural Network (CNN), and the Recurrent Neural Network (RNN) accelerators on a single FPGA. The isolation is enabled by introducing a two-level instruction dispatch module and a multi-core based hardware resources pool. Such designs provide isolated and runtime-programmable hardware resources, which further leads to performance isolation for multi-tenant sharing. On the other hand, to overcome the heavy re-compilation overheads, a tiling-based instruction frame package design and a two-stage static-dynamic compilation, are proposed. Only the lightweight runtime information is re-compiled with ∼1 ms overhead, thus guaranteeing the private cloud’s performance. Finally, the extensive experimental results show that the proposed virtualized solutions achieve up to 3.12× and 6.18× higher throughput in the private cloud compared with the static CNN and RNN baseline designs, respectively.

Automating Cloud Deployment for Real-Time Online Foundation Model Inference

Automating Cloud Deployment for Deep Learning Inference of Real-time Online Services

Review of Inference Time Prediction Approaches of DNN: Emphasis on Service Robots with Cloud-Edge-device Architecture

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Online Learning for Orchestration of Inference in Multi-User End-Edge-Cloud Networks

Joint Foundation Model Caching and Inference of Generative AI Services for Edge Intelligence

Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration

Deep Neural Network Hardware Deployment Optimization via Advanced Active Learning

Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs

DVFO: Learning-Based DVFS for Energy-Efficient Edge-Cloud Collaborative Inference

A Cloud-Edge Collaboration Framework for Cognitive Service.

KAIROS: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources

AutoScale: Optimizing Energy Efficiency of End-to-End Edge Inference under Stochastic Variance

A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud

Collaborative Cloud-Edge Service Cognition Framework for DNN Configuration toward Smart IIoT

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

Efficient Deployment of Large Language Model Across Cloud-Device Systems

DNN Deployment, Task Offloading, and Resource Allocation for Joint Task Inference in IIoT

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

EosDNN: An Efficient Offloading Scheme for DNN Inference Acceleration in Local-Edge-Cloud Collaborative Environments

Joint DNN Partition and Resource Allocation Optimization for Energy-Constrained Hierarchical Edge-Cloud Systems