Abstract:To efficiently deploy state-of-the-art deep neural network (DNN) workloads with growing computational intensity and structural complexity, scalable DNN accelerators have been proposed in recent years, which are featured by multitensor engines and distributed on-chip buffers. Such spatial architectures have significantly expanded scheduling space in terms of parallelism and data reuse potentials, which demands for delicate workload orchestration. Previous works on DNN's hardware mapping problem mainly focus on operator-level loop transformation for single array, which are insufficient for this new challenge. Resource partitioning methods for multi-engines such as CNN-partition and inter-layer pipelining have been studied. However, their intrinsic disadvantages of workload unbalance and pipeline delay still prevent scalable accelerators from releasing full potentials. In this paper, we propose atomic dataflow, a novel graph-level scheduling and mapping approach developed for DNN inference. Instead of partitioning hardware resources into fixed regions and binding each DNN layer to a certain region sequentially, atomic dataflow schedules the DNN computation graph in workload-specific granularity (atoms) to ensure PE-array utilization, supports flexible atom ordering to exploit parallelism, and orchestrates atom-engine mapping to optimize data reuse between spatially connected tensor engines. Firstly, we propose a simulated annealing based atomic tensor generation algorithm to minimize load unbalance. Secondly, we develop a dynamic programming based atomic DAG scheduling algorithm to systematically explore massive ordering potentials. Finally, to facilitate data locality and reduce expensive off-chip memory access, we present mapping and buffering strategies to efficiently utilize distributed on-chip storage. With an automated optimization framework being established, experimental results show significant improvements over baseline approaches in terms of performance, hardware utilization, and energy consumption.

Inter-layer Scheduling Space Definition and Exploration for Tiled Accelerators.

Local Adaptive Resource Scheduling for Internet-Based Computation on SMT Platform

A novel cross-layer framework for early-stage power delivery and architecture co-exploration.

An Adaptive Performance-oriented Scheduler for Static and Dynamic Heterogeneity

DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling

AINNS: All-Inclusive Neural Network Scheduling via Accelerator Formalization

Efficient Scheduling of Irregular Network Structures on CNN Accelerators

Inter-Layer Scheduling Space Exploration for Multi-model Inference on Heterogeneous Chiplets

An Elastic Task Scheduling Scheme on Coarse-Grained Reconfigurable Architectures

LoopTree: Exploring the Fused-layer Dataflow Accelerator Design Space

Towards Heterogeneous Multi-core Accelerators Exploiting Fine-grained Scheduling of Layer-Fused Deep Neural Networks

Atomic Dataflow based Graph-Level Workload Orchestration for Scalable DNN Accelerators

Aries: A DNN Inference Scheduling Framework for Multi-core Accelerators

Lattice-based Scheduling for Multi-FPGA Systems

gem5-NVDLA: A Simulation Framework for Compiling, Scheduling and Architecture Evaluation on AI System-on-Chips

Serving Multi-DNN Workloads on FPGAs: A Coordinated Architecture, Scheduling, and Mapping Perspective.

PALM: A Efficient Performance Simulator for Tiled Accelerators with Large-scale Model Training

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach Using MAESTRO

HiEval: A scheduling performance estimation approach for spatial accelerators via hierarchical abstraction

Analyzing the Design Space of Spatial Tensor Accelerators on FPGAs

AERO: Design Space Exploration Framework for Resource-Constrained CNN Mapping on Tile-Based Accelerators