Abstract:To efficiently deploy state-of-the-art deep neural network (DNN) workloads with growing computational intensity and structural complexity, scalable DNN accelerators have been proposed in recent years, which are featured by multitensor engines and distributed on-chip buffers. Such spatial architectures have significantly expanded scheduling space in terms of parallelism and data reuse potentials, which demands for delicate workload orchestration. Previous works on DNN's hardware mapping problem mainly focus on operator-level loop transformation for single array, which are insufficient for this new challenge. Resource partitioning methods for multi-engines such as CNN-partition and inter-layer pipelining have been studied. However, their intrinsic disadvantages of workload unbalance and pipeline delay still prevent scalable accelerators from releasing full potentials. In this paper, we propose atomic dataflow, a novel graph-level scheduling and mapping approach developed for DNN inference. Instead of partitioning hardware resources into fixed regions and binding each DNN layer to a certain region sequentially, atomic dataflow schedules the DNN computation graph in workload-specific granularity (atoms) to ensure PE-array utilization, supports flexible atom ordering to exploit parallelism, and orchestrates atom-engine mapping to optimize data reuse between spatially connected tensor engines. Firstly, we propose a simulated annealing based atomic tensor generation algorithm to minimize load unbalance. Secondly, we develop a dynamic programming based atomic DAG scheduling algorithm to systematically explore massive ordering potentials. Finally, to facilitate data locality and reduce expensive off-chip memory access, we present mapping and buffering strategies to efficiently utilize distributed on-chip storage. With an automated optimization framework being established, experimental results show significant improvements over baseline approaches in terms of performance, hardware utilization, and energy consumption.

ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs

DyGA: A Hardware-Efficient Accelerator with Traffic-Aware Dynamic Scheduling for Graph Convolutional Networks.

Effective GPU Sharing Under Compiler Guidance

POSTER: Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

Software-hardware Co-Design for Accelerating Large-Scale Graph Convolutional Network Inference on FPGA

An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels

Automatic Parallelization of Sequential Programs

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

Efficient Kernel Management on GPUs.

Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC

Intra-Cluster Coalescing and Distributed-Block Scheduling to Reduce GPU NoC Pressure.

Fair and Cache Blocking Aware Warp Scheduling for Concurrent Kernel Execution on GPU

Simultaneous Multikernel: Fine-Grained Sharing of GPUs.

HyGCN: A GCN Accelerator with Hybrid Architecture

Atomic Dataflow based Graph-Level Workload Orchestration for Scalable DNN Accelerators

Cooperative Kernels: GPU Multitasking for Blocking Algorithms (Extended Version)

Two-Stage Modeling and Control of Concurrent Tasks in a Multi-Kernel GPGPU Environment

Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences

Kernel concurrency opportunities based on GPU benchmarks characterization

Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

High-level Synthesis of Multiple Dependent CUDA Kernels on FPGA