Abstract:A cost-effective multi-tenant neural network execution is becoming one of the most important design goals for modern neural network accelerators. For example, as emerging AI services consist of many heterogeneous neural network executions, a cloud provider wants to serve a large number of clients using a single AI accelerator for improving its cost effectiveness. Therefore, an ideal next-generation neural network accelerator should support a simultaneous multi-neural network execution, while fully utilizing its hardware resources. However, existing accelerators which are optimized for a single neural network execution can suffer from severe resource underutilization when running multiple neural networks, mainly due to the load imbalance between computation and memory-access tasks from different neural networks. In this paper, we propose AI-MultiTasking (AI-MT), a novel accelerator architecture which enables a cost-effective, high-performance multi-neural network execution. The key idea of AI-MT is to fully utilize the accelerator's computation resources and memory bandwidth by matching compute- and memory-intensive tasks from different networks and executing them in parallel. However, it is highly challenging to find and schedule the best load-matching tasks from different neural networks during runtime, without significantly increasing the size of on-chip memory. To overcome the challenges, AI-MT first creates fine-grain tasks at compile time by dividing each layer into multiple identical sub-layers. During runtime, AI-MT dynamically applies three sub-layer scheduling methods: memory block prefetching and compute block merging for the best resource load matching, and memory block eviction for the minimum on-chip memory footprint. Our evaluations using MLPerf benchmarks show that AI-MT achieves up to 1.57x speedup over the baseline scheduling method.

A Multi-Mode Visual Recognition Hardware Accelerator for AR/MR Glasses

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

DaDianNao: A Machine-Learning Supercomputer

A High-Performance Pixel-Level Fully Pipelined Hardware Accelerator for Neural Networks

A Multi-Task Hardwired Accelerator for Face Detection and Alignment

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

A Heterogeneous Architecture for the Vision Processing Unit with a Hybrid Deep Neural Network Accelerator

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing

A 182 mW 94.3 f/s in Full HD Pattern-Matching Based Image Recognition Accelerator for an Embedded Vision System in 0.13-$\mu{\rm m}$ CMOS Technology

A 3D Tiled Low Power Accelerator for Convolutional Neural Network

An Efficient General-Purpose Optical Accelerator for Neural Networks

PIXEL: Photonic Neural Network Accelerator

High-speed hardware accelerator based on brightness improved by Light-DehazeNet

Random resistive memory-based deep extreme point learning machine for unified visual processing

A 3d Multi-Layer Cmos-Rram Accelerator for Neural Network

L-MPC: A LUT based MuIti-LeveI Prediction-Correction Architecture for Accelerating Binary-Weight Hourglass Network

Design of a Convolutional Neural Network Accelerator Based on On-Chip Data Reordering

Photonic Neuromorphic Accelerator for Convolutional Neural Networks based on an Integrated Reconfigurable Mesh

MENAGE: Mixed-Signal Event-Driven Neuromorphic Accelerator for Edge Applications

A Multi-Neural Network Acceleration Architecture

A Low-Power Accelerator for Deep Neural Networks with Enlarged Near-Zero Sparsity