Abstract:Modern Artificial Intelligence (AI) workloads demand computing systems with large silicon area to sustain throughput and competitive performance. However, prohibitive manufacturing costs and yield limitations at advanced tech nodes and die-size reaching the reticle limit restrain us from achieving this. With the recent innovations in advanced packaging technologies, chiplet-based architectures have gained significant attention in the AI hardware domain. However, the vast design space of chiplet-based AI accelerator design and the absence of system and package-level co-design methodology make it difficult for the designer to find the optimum design point regarding Power, Performance, Area, and manufacturing Cost (PPAC). This paper presents Chiplet-Gym, a Reinforcement Learning (RL)-based optimization framework to explore the vast design space of chiplet-based AI accelerators, encompassing the resource allocation, placement, and packaging architecture. We analytically model the PPAC of the chiplet-based AI accelerator and integrate it into an OpenAI gym environment to evaluate the design points. We also explore non-RL-based optimization approaches and combine these two approaches to ensure the robustness of the optimizer. The optimizer-suggested design point achieves 1.52X throughput, 0.27X energy, and 0.01X die cost while incurring only 1.62X package cost of its monolithic counterpart at iso-area.

What problem does this paper attempt to address?

The paper attempts to address the problem of optimizing chiplet-based AI accelerator design in the context of advanced packaging technology. Specifically, the researchers are confronted with the demands of modern AI workloads for large-scale silicon area, but face challenges such as high manufacturing costs, yield limitations, and the nearing limits of wafer size. To overcome these obstacles, the study proposes the Chiplet-Gym framework, a design space exploration framework based on Reinforcement Learning (RL), aimed at optimizing the design space of chiplet-based AI accelerators, covering aspects such as resource allocation, layout, and packaging architecture. Specifically, the main contributions of the paper include: 1. **Proposing a co-design methodology**: This includes resource allocation (such as the number of AI chips, memory capacity, and bandwidth), partitioning and layout of chiplets, selection of different packaging technologies and their attributes (such as bandwidth, bump density, cost, and complexity), to optimize the system-level power, performance, area, and cost (PPAC) of chiplet-based AI accelerators. 2. **Establishing an analytical model**: A cost model for evaluating chiplet-based architectures is developed, enabling rapid assessment of AI accelerator design schemes in time and resource-constrained environments. 3. **Optimizing design parameters**: The interdependencies between design space parameters are identified, and the optimization problem is formulated as a reinforcement learning problem. Additionally, non-RL-based optimization methods (such as simulated annealing) are explored and combined with RL methods to ensure the robustness of the optimizer. Through the above work, the research team validated the performance improvements of their optimized design over state-of-the-art monolithic GPUs in MLPerf benchmark tests, demonstrating the effectiveness and practicality of the Chiplet-Gym framework.

Chiplet-Gym: Optimizing Chiplet-based AI Accelerator Design with Reinforcement Learning

An Open-Source ML-Based Full-Stack Optimization Framework for Machine Learning Accelerators

Designing Efficient and High-performance AI Accelerators with Customized STT-MRAM

Benchmarking End-To-End Performance of AI-Based Chip Placement Algorithms

Deep Reinforcement Learning-Based Power Management for Chiplet-Based Multicore Systems

A Transferable Approach for Partitioning Machine Learning Models on Multi-Chip-Modules

NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators

Hierarchical Reinforcement Learning for Chip-Macro Placement in Integrated Circuit

Chip Placement with Deep Reinforcement Learning

Analysis of the Designs and Applications of AI Chip

Hardware Accelerated Optimization of Deep Learning Model on Artificial Intelligence Chip

RapidChiplet: A Toolchain for Rapid Design Space Exploration of Chiplet Architectures

Computing Utilization Enhancement for Chiplet-based Homogeneous Processing-in-Memory Deep Learning Processors

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

A Scalable Multi-Chiplet Deep Learning Accelerator with Hub-Side 2.5D Heterogeneous Integration.

Learned Hardware/Software Co-Design of Neural Accelerators

Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

Modular High-Performance Computing Using Chiplets

A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models

Multi-Objective Hardware-Mapping Co-Optimisation for Multi-DNN Workloads on Chiplet-based Accelerators