Abstract:A cost-effective multi-tenant neural network execution is becoming one of the most important design goals for modern neural network accelerators. For example, as emerging AI services consist of many heterogeneous neural network executions, a cloud provider wants to serve a large number of clients using a single AI accelerator for improving its cost effectiveness. Therefore, an ideal next-generation neural network accelerator should support a simultaneous multi-neural network execution, while fully utilizing its hardware resources. However, existing accelerators which are optimized for a single neural network execution can suffer from severe resource underutilization when running multiple neural networks, mainly due to the load imbalance between computation and memory-access tasks from different neural networks. In this paper, we propose AI-MultiTasking (AI-MT), a novel accelerator architecture which enables a cost-effective, high-performance multi-neural network execution. The key idea of AI-MT is to fully utilize the accelerator's computation resources and memory bandwidth by matching compute- and memory-intensive tasks from different networks and executing them in parallel. However, it is highly challenging to find and schedule the best load-matching tasks from different neural networks during runtime, without significantly increasing the size of on-chip memory. To overcome the challenges, AI-MT first creates fine-grain tasks at compile time by dividing each layer into multiple identical sub-layers. During runtime, AI-MT dynamically applies three sub-layer scheduling methods: memory block prefetching and compute block merging for the best resource load matching, and memory block eviction for the minimum on-chip memory footprint. Our evaluations using MLPerf benchmarks show that AI-MT achieves up to 1.57x speedup over the baseline scheduling method.

Automatic Mapping of Heterogeneous DNN Models on Adaptive Multi-Accelerator Systems

MARS: Exploiting Multi-Level Parallelism for DNN Workloads on Adaptive Multi-Accelerator Systems

M2M: A Fine-Grained Mapping Framework to Accelerate Multiple DNNs on a Multi-Chiplet Architecture

Efficient Mapping Space Exploration for a Reconfigurable Neural Accelerator

Memory and Computation Coordinated Mapping of DNNs Onto Complex Heterogeneous SoC.

Memory-Computing Decoupling: A DNN Multitasking Accelerator with Adaptive Data Arrangement.

Mrna: Enabling Efficient Mapping Space Exploration for a Reconfiguration Neural Accelerator

RealArch: A Real-Time Scheduler for Mapping Multi-Tenant DNNs on Multi-Core Accelerators

Aries: A DNN Inference Scheduling Framework for Multi-core Accelerators

A Multi-Neural Network Acceleration Architecture

Multi-Objective Hardware-Mapping Co-Optimisation for Multi-DNN Workloads on Chiplet-based Accelerators

HASP: Hierarchical Asynchronous Parallelism for Multi-NN Tasks

CD-MSA: Cooperative and Deadline-Aware Scheduling for Efficient Multi-Tenancy on DNN Accelerators

CCASM: A Computation- and Communication-Aware Scheduling and Mapping Algorithm for NoC-Based DNN Accelerators

An Efficient Accelerator for Point-based and Voxel-based Point Cloud Neural Networks

Enabling Latency-Sensitive DNN Inference Via Joint Optimization of Model Surgery and Resource Allocation in Heterogeneous Edge

NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators

A Heterogeneous Full-stack AI Platform for Performance Monitoring and Hardware-specific Optimizations

SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators

A 3D Hybrid Optical-Electrical NoC Using Novel Mapping Strategy Based DCNN Dataflow Acceleration

ARCO:Adaptive Multi-Agent Reinforcement Learning-Based Hardware/Software Co-Optimization Compiler for Improved Performance in DNN Accelerator Design