Generality is the Key Dimension in Accelerator Design
Jian Weng,Vidushi Dadu,Sihao Liu,Tony Nowatzki
2021-01-01
Abstract:Automating the hardware and software stack design of domainspecific accelerators can enable a much broader applicability of efficient accelerator architectures. We take the position that what distinguishes domain-specific accelerators is their degree of generality along key dimensions (eg. generality of control patterns, memory access, reuse, and parallelism). Generality is expensive in terms of hardware overhead, so accelerator designers carefully choose which dimensions to be general. However, automated accelerator design tools (eg. high-level synthesis) typically focus their analysis on optimizing a single program region (allocating resources, executing operations in parallel, pipelining and orchestrating data, etc.). Generality, if it is needed, is left to the programmer to reason about in an awkward way. We argue that a new approach is needed, where generality is an integral and explicit aspect of automated accelerator design. This position raises difficult questions of how should generality be expressed in design exploration and how the hardware designer should convey the types of generality required. We discuss with possible solutions based on our experiences with the DSAGEN accelerator design framework. 1 GENERALITY DEFINES ACCELERATORS One key challenge in automated accelerator generation is designing for generality. In fact, we posit that it is the degrees of generality along various dimensions that are the key distinguishing features of existing manually designed accelerators. Figure 1 overviews possible generality dimensions: • Inst./Datatype: Breadth of compute units/datatypes. • Control: The degree to which arbitrary forms of control flow are supported. For example, the ability to execute data-dependent control efficiently subsumes static control. • Memory Access: How effective are arbitrary memory access patterns. For example, indirect access can be viewed as a generalization of simpler affine access patterns. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). LATTE ’21, April 15, 2021, Virtual, Earth © 2021 Copyright held by the owner/author(s). Affine Indirect Access Linear Static Control Data-dep. Unpredictable Single Datatype/Inst. Flexiblewidth Multiple Switched/ Reconfig. All-to-all Network Fixed Dynamic ($) Hierarchical Reuse Static (SPAD) Dynamic Ordered Parallelism Static Less “General” More “General” Figure 1: Dimensions of Generality in Accelerators • Reuse: The degree to which dynamic data-reuse is supported. For example, this could mean the difference between the use of scratchpads and caches. • Network: The flexibility in routing between hardware units. Eg. the difference between a fixed network (eg. systolic array) compared to a reconfigurable network (eg. static CGRA network or a dynamically routed Network on Chip). • Parallelism: To what degree can irregular parallelism be executed efficiently. For example, the support for task parallelism with programmable scheduling policies and load balancing. These dimensions of generality help explain the tradeoffs that accelerators make. Take the DianNao [2] accelerator for dense deep learning kernels: it provides just enough access-generality to support the different patterns required for matrix multiplication, and just enough network/instruction generality to support fused non-linear transforms. An architecture like Chronos [1] is extremely flexible in its parallelism support, but uses fixed-function PEs. Q100 [6] has very efficient support for data-dependent control flow (joins/partitions), but has no support for general memory access (only contiguous). Graphdyns [7] has general support for indirect memory, remote atomics, and flexible load balancing, but can only support synchronous parallelism and programmer-controlled scratchpads for exploiting reuse. Further, the aspects of the design which are general are the primary source of hardware cost. The task queue in Chronos costs more resources than any processing element. The crossbar that enables indirect access in graph accelerators is a major source of area overhead (often the largest source [7]). The partitioners in Q100 cost roughly 50% area in most designs. In the era of specialization, generality has to be applied judiciously. LATTE ’21, April 15, 2021, Virtual, Earth Nowatzki et al.