Abstract:A cost-effective multi-tenant neural network execution is becoming one of the most important design goals for modern neural network accelerators. For example, as emerging AI services consist of many heterogeneous neural network executions, a cloud provider wants to serve a large number of clients using a single AI accelerator for improving its cost effectiveness. Therefore, an ideal next-generation neural network accelerator should support a simultaneous multi-neural network execution, while fully utilizing its hardware resources. However, existing accelerators which are optimized for a single neural network execution can suffer from severe resource underutilization when running multiple neural networks, mainly due to the load imbalance between computation and memory-access tasks from different neural networks. In this paper, we propose AI-MultiTasking (AI-MT), a novel accelerator architecture which enables a cost-effective, high-performance multi-neural network execution. The key idea of AI-MT is to fully utilize the accelerator's computation resources and memory bandwidth by matching compute- and memory-intensive tasks from different networks and executing them in parallel. However, it is highly challenging to find and schedule the best load-matching tasks from different neural networks during runtime, without significantly increasing the size of on-chip memory. To overcome the challenges, AI-MT first creates fine-grain tasks at compile time by dividing each layer into multiple identical sub-layers. During runtime, AI-MT dynamically applies three sub-layer scheduling methods: memory block prefetching and compute block merging for the best resource load matching, and memory block eviction for the minimum on-chip memory footprint. Our evaluations using MLPerf benchmarks show that AI-MT achieves up to 1.57x speedup over the baseline scheduling method.

Polymorphic Accelerators for Deep Neural Networks

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

Exploring Resource-Aware Deep Neural Network Accelerator and Architecture Design

Heterogeneous Multi-core Array-based DNN Accelerator

A Multi-Neural Network Acceleration Architecture

Software-defined Design Space Exploration for an Efficient DNN Accelerator Architecture

A Data-Driven Asynchronous Neural Network Accelerator

Nebula: A Scalable and Flexible Accelerator for DNN Multi-Branch Blocks on Embedded Systems

Multi-DNN Accelerators for Next-Generation AI Systems

Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators

Field-Programmable Deep Neural Network (DNN) Learning and Inference accelerator: a concept

HybridDNN: A Framework for High-Performance Hybrid DNN Accelerator Design and Implementation.

A Precision-Scalable Deep Neural Network Accelerator with Activation Sparsity Exploitation

Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition

Energy-Efficient Accelerator Design for Deformable Convolution Networks

A Small-Footprint Accelerator for Large-Scale Neural Networks

INDM: Chiplet-Based Interconnect Network and Dataflow Mapping for DNN Accelerators

Ifpna: A Flexible and Efficient Deep Neural Network Accelerator with a Programmable Data Flow Engine in 28nm CMOS.

An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses

A Formalism of DNN Accelerator Flexibility

FlexNN: A Dataflow-aware Flexible Deep Learning Accelerator for Energy-Efficient Edge Devices