Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators

Konstantin Lübeck,Alexander Louis-Ferdinand Jung,Felix Wedlich,Mika Markus Müller,Federico Nicolás Peccia,Felix Thömmes,Jannik Steinmetz,Valentin Biermaier,Adrian Frischknecht,Paul Palomero Bernardo,Oliver Bringmann
2024-09-13
Abstract:Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the intended AI workload. To facilitate this, we present an automated generation approach for fast performance models to accurately estimate the latency of a DNN mapped onto systematically modeled and concisely described accelerator architectures. Using our accelerator architecture description method, we modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. Together with DNN mappings for those modeled architectures, we perform a combined DNN/hardware dependency graph analysis, which enables us, in the best case, to evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup. We outperform regression and analytical models in terms of mean absolute percentage error (MAPE) compared to simulation results, while being several magnitudes faster than an RTL simulation.
Performance,Artificial Intelligence,Hardware Architecture,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of efficiently designing and evaluating the performance of deep neural network (DNN) accelerators on resource-constrained edge devices. Specifically, the authors propose an automated method for generating high-performance models to quickly and accurately estimate the latency of DNNs on specific accelerator architectures. ### Main Issues 1. **DNN Implementation on Resource-Constrained Devices**: - Implementing DNNs on edge devices faces resource constraints, requiring customized hardware accelerator architectures. - A clear understanding of the performance characteristics of these accelerators when executing specific AI workloads is needed. 2. **Challenges in Performance Evaluation**: - Selecting the most suitable accelerator architecture for a specific application is very difficult because reliable performance metrics are usually only available through time-consuming simulators or datasheets, which provide peak OPS/s for only a few selected DNNs. - In the early stages of accelerator design, it is necessary to quickly compare different variants, including changes in both hardware and software. ### Solution The authors propose an automated performance evaluation method that can quickly and accurately model and evaluate DNN accelerators on different abstraction levels of hardware and software. The main contributions include: 1. **Abstract Computer Architecture Description Language (ACADL)**: - Introducing ACADL, which allows for modeling and evaluating a wide range of accelerator architectures with various architectural parameters, supporting different levels of abstraction. 2. **Automated Generation of Architecture Instruction Dependency Graph (AIDG)**: - Given an accelerator architecture described with ACADL and a DNN mapping, automatically generating AIDG. This is a novel approach that flexibly and abstractly represents hardware and software, whereas other performance evaluation methods mainly focus on software (e.g., instruction set simulators) or hardware (e.g., register-transfer level simulators). 3. **Fast Evaluation of AIDG**: - Proposing a method for quickly evaluating AIDG, accurately estimating the performance of DNNs on accelerator architectures by analyzing only a minimal portion of all loop kernel iterations (e.g., 0.0001%), matching the accuracy of RTL simulation and outperforming regression and analytical models reported in the literature. ### Method Overview 1. **ACADL Modeling**: - Using ACADL to model various parameterizable accelerator architectures, capturing the propagation of data and instructions at different levels of abstraction. 2. **DNN Mapping**: - Mapping the given DNN to the accelerator architecture, generating loop kernel instructions and their iteration counts. 3. **AIDG Construction**: - Propagating each loop kernel instruction through the ACADL object graph to construct the AIDG, capturing structural and data dependencies between instructions occupying hardware modules. 4. **AIDG Evaluation**: - Using the generated AIDG and the computed loop kernel iteration counts, estimating the end-to-end latency of the entire DNN layer by executing only a few iterations of the loop kernel instructions. By analyzing the first few iterations until a stable single-iteration end-to-end latency is established, and then multiplying it by the remaining loop kernel iteration counts, the end-to-end latency of the entire DNN layer is obtained. ### Experimental Validation The authors validated the generality and accuracy of the proposed method by modeling four different accelerator architectures and using three state-of-the-art DNNs optimized for edge devices, comparing it with regression models, an improved Roofline model, and Timeloop reported in the literature. In summary, the paper addresses the challenges of designing and evaluating DNN accelerators on resource-constrained edge devices by proposing an automated method for generating high-performance models.