Atomic Dataflow based Graph-Level Workload Orchestration for Scalable DNN Accelerators

Shixuan Zheng,Xianjue Zhang,Leibo Liu,Shaojun Wei,Shouyi Yin
DOI: https://doi.org/10.1109/HPCA53966.2022.00042
2022-01-01
Abstract:To efficiently deploy state-of-the-art deep neural network (DNN) workloads with growing computational intensity and structural complexity, scalable DNN accelerators have been proposed in recent years, which are featured by multitensor engines and distributed on-chip buffers. Such spatial architectures have significantly expanded scheduling space in terms of parallelism and data reuse potentials, which demands for delicate workload orchestration. Previous works on DNN's hardware mapping problem mainly focus on operator-level loop transformation for single array, which are insufficient for this new challenge. Resource partitioning methods for multi-engines such as CNN-partition and inter-layer pipelining have been studied. However, their intrinsic disadvantages of workload unbalance and pipeline delay still prevent scalable accelerators from releasing full potentials. In this paper, we propose atomic dataflow, a novel graph-level scheduling and mapping approach developed for DNN inference. Instead of partitioning hardware resources into fixed regions and binding each DNN layer to a certain region sequentially, atomic dataflow schedules the DNN computation graph in workload-specific granularity (atoms) to ensure PE-array utilization, supports flexible atom ordering to exploit parallelism, and orchestrates atom-engine mapping to optimize data reuse between spatially connected tensor engines. Firstly, we propose a simulated annealing based atomic tensor generation algorithm to minimize load unbalance. Secondly, we develop a dynamic programming based atomic DAG scheduling algorithm to systematically explore massive ordering potentials. Finally, to facilitate data locality and reduce expensive off-chip memory access, we present mapping and buffering strategies to efficiently utilize distributed on-chip storage. With an automated optimization framework being established, experimental results show significant improvements over baseline approaches in terms of performance, hardware utilization, and energy consumption.
What problem does this paper attempt to address?