Abstract:Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the intended AI workload. To facilitate this, we present an automated generation approach for fast performance models to accurately estimate the latency of a DNN mapped onto systematically modeled and concisely described accelerator architectures. Using our accelerator architecture description method, we modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. Together with DNN mappings for those modeled architectures, we perform a combined DNN/hardware dependency graph analysis, which enables us, in the best case, to evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup. We outperform regression and analytical models in terms of mean absolute percentage error (MAPE) compared to simulation results, while being several magnitudes faster than an RTL simulation.

What problem does this paper attempt to address?

The paper attempts to address the problem of efficiently designing and evaluating the performance of deep neural network (DNN) accelerators on resource-constrained edge devices. Specifically, the authors propose an automated method for generating high-performance models to quickly and accurately estimate the latency of DNNs on specific accelerator architectures. ### Main Issues 1. **DNN Implementation on Resource-Constrained Devices**: - Implementing DNNs on edge devices faces resource constraints, requiring customized hardware accelerator architectures. - A clear understanding of the performance characteristics of these accelerators when executing specific AI workloads is needed. 2. **Challenges in Performance Evaluation**: - Selecting the most suitable accelerator architecture for a specific application is very difficult because reliable performance metrics are usually only available through time-consuming simulators or datasheets, which provide peak OPS/s for only a few selected DNNs. - In the early stages of accelerator design, it is necessary to quickly compare different variants, including changes in both hardware and software. ### Solution The authors propose an automated performance evaluation method that can quickly and accurately model and evaluate DNN accelerators on different abstraction levels of hardware and software. The main contributions include: 1. **Abstract Computer Architecture Description Language (ACADL)**: - Introducing ACADL, which allows for modeling and evaluating a wide range of accelerator architectures with various architectural parameters, supporting different levels of abstraction. 2. **Automated Generation of Architecture Instruction Dependency Graph (AIDG)**: - Given an accelerator architecture described with ACADL and a DNN mapping, automatically generating AIDG. This is a novel approach that flexibly and abstractly represents hardware and software, whereas other performance evaluation methods mainly focus on software (e.g., instruction set simulators) or hardware (e.g., register-transfer level simulators). 3. **Fast Evaluation of AIDG**: - Proposing a method for quickly evaluating AIDG, accurately estimating the performance of DNNs on accelerator architectures by analyzing only a minimal portion of all loop kernel iterations (e.g., 0.0001%), matching the accuracy of RTL simulation and outperforming regression and analytical models reported in the literature. ### Method Overview 1. **ACADL Modeling**: - Using ACADL to model various parameterizable accelerator architectures, capturing the propagation of data and instructions at different levels of abstraction. 2. **DNN Mapping**: - Mapping the given DNN to the accelerator architecture, generating loop kernel instructions and their iteration counts. 3. **AIDG Construction**: - Propagating each loop kernel instruction through the ACADL object graph to construct the AIDG, capturing structural and data dependencies between instructions occupying hardware modules. 4. **AIDG Evaluation**: - Using the generated AIDG and the computed loop kernel iteration counts, estimating the end-to-end latency of the entire DNN layer by executing only a few iterations of the loop kernel instructions. By analyzing the first few iterations until a stable single-iteration end-to-end latency is established, and then multiplying it by the remaining loop kernel iteration counts, the end-to-end latency of the entire DNN layer is obtained. ### Experimental Validation The authors validated the generality and accuracy of the proposed method by modeling four different accelerator architectures and using three state-of-the-art DNNs optimized for edge devices, comparing it with regression models, an improved Roofline model, and Timeloop reported in the literature. In summary, the paper addresses the challenges of designing and evaluating DNN accelerators on resource-constrained edge devices by proposing an automated method for generating high-performance models.

Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

Machine Learning-enabled Performance Model for DNN Applications and AI Accelerator

Model-Platform Optimized Deep Neural Network Accelerator Generation Through Mixed-Integer Geometric Programming.

Software-defined Design Space Exploration for an Efficient DNN Accelerator Architecture

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability.

Mitigating Edge Machine Learning Inference Bottlenecks: An Empirical Study on Accelerating Google Edge Models

Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks

Being-ahead: Benchmarking and Exploring Accelerators for Hardware-Efficient AI Deployment

A Small-Footprint Accelerator for Large-Scale Neural Networks

A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

A Heterogeneous Full-stack AI Platform for Performance Monitoring and Hardware-specific Optimizations

Deep Learning Accelerators' Configuration Space Exploration Effect on Performance and Resource Utilization: A Gemmini Case Study

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training

Polymorphic Accelerators for Deep Neural Networks

Hardware Accelerator Design for Sparse DNN Inference and Training: A Tutorial

Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

An Open-Source ML-Based Full-Stack Optimization Framework for Machine Learning Accelerators

Heterogeneous Multi-core Array-based DNN Accelerator