Abstract:A cost-effective multi-tenant neural network execution is becoming one of the most important design goals for modern neural network accelerators. For example, as emerging AI services consist of many heterogeneous neural network executions, a cloud provider wants to serve a large number of clients using a single AI accelerator for improving its cost effectiveness. Therefore, an ideal next-generation neural network accelerator should support a simultaneous multi-neural network execution, while fully utilizing its hardware resources. However, existing accelerators which are optimized for a single neural network execution can suffer from severe resource underutilization when running multiple neural networks, mainly due to the load imbalance between computation and memory-access tasks from different neural networks. In this paper, we propose AI-MultiTasking (AI-MT), a novel accelerator architecture which enables a cost-effective, high-performance multi-neural network execution. The key idea of AI-MT is to fully utilize the accelerator's computation resources and memory bandwidth by matching compute- and memory-intensive tasks from different networks and executing them in parallel. However, it is highly challenging to find and schedule the best load-matching tasks from different neural networks during runtime, without significantly increasing the size of on-chip memory. To overcome the challenges, AI-MT first creates fine-grain tasks at compile time by dividing each layer into multiple identical sub-layers. During runtime, AI-MT dynamically applies three sub-layer scheduling methods: memory block prefetching and compute block merging for the best resource load matching, and memory block eviction for the minimum on-chip memory footprint. Our evaluations using MLPerf benchmarks show that AI-MT achieves up to 1.57x speedup over the baseline scheduling method.

Multi-clusters: an Efficient Design Paradigm of NN Accelerator Architecture Based on FPGA

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A Multi-Level-Optimization Framework for FPGA-Based Cellular Neural Network Implementation.

DaDianNao: A Machine-Learning Supercomputer

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

Design Exploration of Multi-FPGAs for Accelerating Deep Learning

DGNN-Booster: A Generic FPGA Accelerator Framework For Dynamic Graph Neural Network Inference

A Survey of FPGA-Based Neural Network Accelerator

Energy-Efficient Cnn Implementation on A Deeply Pipelined Fpga Cluster

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

A Multi-Neural Network Acceleration Architecture

A Hybrid Heterogeneous Neural Network Accelerator Based on Systolic Array

A Fine-Grained Sparse Accelerator for Multi-Precision DNN.

A high throughput acceleration for hybrid neural networks with efficient resource management on FPGA

Adaptive design and implementation of automatic modulation recognition accelerator

Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration

MuGRA: A Scalable Multi-Grained Reconfigurable Accelerator Powered by Elastic Neural Network

Design of FPGA-Based Accelerator for Convolutional Neural Network under Heterogeneous Computing Framework with OpenCL

MuDBN: An Energy-Efficient and High-Performance Multi-FPGA Accelerator for Deep Belief Networks.

A neural network accelerated optimization method for FPGA