Abstract:Automating the hardware and software stack design of domainspecific accelerators can enable a much broader applicability of efficient accelerator architectures. We take the position that what distinguishes domain-specific accelerators is their degree of generality along key dimensions (eg. generality of control patterns, memory access, reuse, and parallelism). Generality is expensive in terms of hardware overhead, so accelerator designers carefully choose which dimensions to be general. However, automated accelerator design tools (eg. high-level synthesis) typically focus their analysis on optimizing a single program region (allocating resources, executing operations in parallel, pipelining and orchestrating data, etc.). Generality, if it is needed, is left to the programmer to reason about in an awkward way. We argue that a new approach is needed, where generality is an integral and explicit aspect of automated accelerator design. This position raises difficult questions of how should generality be expressed in design exploration and how the hardware designer should convey the types of generality required. We discuss with possible solutions based on our experiences with the DSAGEN accelerator design framework. 1 GENERALITY DEFINES ACCELERATORS One key challenge in automated accelerator generation is designing for generality. In fact, we posit that it is the degrees of generality along various dimensions that are the key distinguishing features of existing manually designed accelerators. Figure 1 overviews possible generality dimensions: • Inst./Datatype: Breadth of compute units/datatypes. • Control: The degree to which arbitrary forms of control flow are supported. For example, the ability to execute data-dependent control efficiently subsumes static control. • Memory Access: How effective are arbitrary memory access patterns. For example, indirect access can be viewed as a generalization of simpler affine access patterns. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). LATTE ’21, April 15, 2021, Virtual, Earth © 2021 Copyright held by the owner/author(s). Affine Indirect Access Linear Static Control Data-dep. Unpredictable Single Datatype/Inst. Flexiblewidth Multiple Switched/ Reconfig. All-to-all Network Fixed Dynamic ($) Hierarchical Reuse Static (SPAD) Dynamic Ordered Parallelism Static Less “General” More “General” Figure 1: Dimensions of Generality in Accelerators • Reuse: The degree to which dynamic data-reuse is supported. For example, this could mean the difference between the use of scratchpads and caches. • Network: The flexibility in routing between hardware units. Eg. the difference between a fixed network (eg. systolic array) compared to a reconfigurable network (eg. static CGRA network or a dynamically routed Network on Chip). • Parallelism: To what degree can irregular parallelism be executed efficiently. For example, the support for task parallelism with programmable scheduling policies and load balancing. These dimensions of generality help explain the tradeoffs that accelerators make. Take the DianNao [2] accelerator for dense deep learning kernels: it provides just enough access-generality to support the different patterns required for matrix multiplication, and just enough network/instruction generality to support fused non-linear transforms. An architecture like Chronos [1] is extremely flexible in its parallelism support, but uses fixed-function PEs. Q100 [6] has very efficient support for data-dependent control flow (joins/partitions), but has no support for general memory access (only contiguous). Graphdyns [7] has general support for indirect memory, remote atomics, and flexible load balancing, but can only support synchronous parallelism and programmer-controlled scratchpads for exploiting reuse. Further, the aspects of the design which are general are the primary source of hardware cost. The task queue in Chronos costs more resources than any processing element. The crossbar that enables indirect access in graph accelerators is a major source of area overhead (often the largest source [7]). The partitioners in Q100 cost roughly 50% area in most designs. In the era of specialization, generality has to be applied judiciously. LATTE ’21, April 15, 2021, Virtual, Earth Nowatzki et al.

A Formalism of DNN Accelerator Flexibility

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Software-defined Design Space Exploration for an Efficient DNN Accelerator Architecture

FlexNN: A Dataflow-aware Flexible Deep Learning Accelerator for Energy-Efficient Edge Devices

Polymorphic Accelerators for Deep Neural Networks

An Open-Source ML-Based Full-Stack Optimization Framework for Machine Learning Accelerators

FlexPDA: A Flexible Programming Framework for Deep Learning Accelerators.

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

Model-Platform Optimized Deep Neural Network Accelerator Generation Through Mixed-Integer Geometric Programming.

Being-ahead: Benchmarking and Exploring Accelerators for Hardware-Efficient AI Deployment

A Versatile Acceleration Framework for Machine Learning Algorithms

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration

Generality is the Key Dimension in Accelerator Design

A Precision-Scalable Deep Neural Network Accelerator with Activation Sparsity Exploitation

Apollo: Transferable Architecture Exploration

HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity

FlexiBit: Fully Flexible Precision Bit-parallel Accelerator Architecture for Arbitrary Mixed Precision AI

DLFusion: an Auto-Tuning Compiler for Layer Fusion on Deep Neural Network Accelerator

Early DSE and Automatic Generation of Coarse Grained Merged Accelerators

Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators