An Open-Source ML-Based Full-Stack Optimization Framework for Machine Learning Accelerators

Hadi Esmaeilzadeh,Soroush Ghodrati,Andrew B. Kahng,Joon Kyung Kim,Sean Kinzer,Sayak Kundu,Rohan Mahapatra,Susmita Dey Manasi,Sachin Sapatnekar,Zhiang Wang,Ziqing Zeng
2023-08-23
Abstract:Parameterizable machine learning (ML) accelerators are the product of recent breakthroughs in ML. To fully enable their design space exploration (DSE), we propose a physical-design-driven, learning-based prediction framework for hardware-accelerated deep neural network (DNN) and non-DNN ML algorithms. It adopts a unified approach that combines backend power, performance, and area (PPA) analysis with frontend performance simulation, thereby achieving a realistic estimation of both backend PPA and system metrics such as runtime and energy. In addition, our framework includes a fully automated DSE technique, which optimizes backend and system metrics through an automated search of architectural and backend parameters. Experimental studies show that our approach consistently predicts backend PPA and system metrics with an average 7% or less prediction error for the ASIC implementation of two deep learning accelerator platforms, VTA and VeriGOOD-ML, in both a commercial 12 nm process and a research-oriented 45 nm process.
Machine Learning,Hardware Architecture
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two key issues in the design space exploration (DSE) of machine - learning (ML) accelerators: 1. **Generate optimized ML accelerators**: Design a machine - learning accelerator that can optimize power, performance, and area (PPA). 2. **Select optimized hardware configurations**: Within the given range of architecture parameters and back - end parameters, select a hardware configuration that can optimize system - level metrics such as running time and energy consumption. Specifically, the paper proposes a prediction framework based on physical - design - driven and machine - learning for hardware - accelerated deep neural networks (DNN) and non - DNN machine - learning algorithms. This framework combines back - end power, performance, and area (PPA) analysis with front - end performance simulation, thereby achieving accurate estimation of back - end PPA and system metrics such as running time and energy consumption. In addition, this framework also includes a fully - automatic DSE technique that optimizes back - end and system metrics by automatically searching for architecture and back - end parameters. ### Main contributions 1. **Full - stack optimization framework**: Propose a machine - learning - based full - stack optimization framework that covers key design components in the software - hardware stack, including the target machine - learning algorithm, architecture parameters, RTL generation, SP&R (synthesis, placement, and routing) recipes for hardware implementation, performance simulation, and design space exploration. To the best of the authors' knowledge, this is the first time that the back - end SP&R recipe has been integrated into a framework for optimizing machine - learning accelerators. 2. **Chip - area prediction**: Extend previous work by adding prediction of the machine - learning accelerator chip area and using the target layout utilization as a back - end adjustment feature to predict back - end PPA and system - level performance under different target layout utilizations. 3. **Sampling - method research**: Research three different sampling methods: Latin hypercube sampling (LHS) and two low - discrepancy - sequence (LDS) sampling methods (using Sobol and Halton sequences), and compare them under different sample sizes. Selecting an efficient sampling method and an appropriate sample size is crucial for constructing a high - precision prediction model. 4. **Automatic - optimization method**: Introduce a method based on physical - design - driven and multi - objective tree - structured Parzen estimator (MOTPE) to automatically optimize machine - learning accelerators to meet the requirements of a given target machine - learning algorithm and metrics. Experimental results show that this method can significantly reduce the implementation time of optimizing machine - learning accelerators, from several months to a few days. 5. **Modular method**: Propose a method that takes advantage of the highly modular characteristics of machine - learning accelerators to generate a logical - level graph, where each leaf node represents a building block of a machine - learning accelerator. This method uses a graph - convolutional network (GCN) to extract graph embeddings and train the model. Experimental results show that even with less data, the GCN model can match or exceed the prediction performance of other models on the test data set. ### Solutions To achieve the above goals, the paper adopts the following methods: - **Data generation and model training**: Generate sample points through different sampling methods, and each sample point corresponds to a specific architecture configuration and back - end configuration. The generated RTL netlists are subjected to SP&R and system - level simulation to capture back - end PPA and system - level metrics for model training and testing. - **Multi - objective optimization**: Use the trained model and the MOTPE method to optimize energy and chip area under the condition of meeting power and running - time constraints, find the Pareto - optimal front, and select the best configuration according to the cost function. In summary, this paper addresses key challenges in the design of machine - learning accelerators by proposing an innovative full - stack optimization framework, improving design efficiency and performance.