Abstract:The revolution of machine learning poses an unprecedented demand for computation resources, urging more transistors on a single monolithic chip, which is not sustainable in the Post-Moore era. The multichip integration with small functional dies, called chiplets, can reduce the manufacturing cost, improve the fabrication yield, and achieve die-level reuse for different system scales. DNN workload mapping and hardware design space exploration on such multichip systems are critical, but missing in the current stage.This work provides a hierarchical and analytical framework to describe the DNN mapping on a multichip accelerator and analyze the communication overhead. Based on this framework, we propose an automatic tool called NN-Baton with a pre-design flow and a post-design flow. The pre-design flow aims to guide the chiplet granularity exploration with given area and performance budgets for the target workload. The post-design flow focuses on the workload orchestration on different computation levels -package, chiplet, and core - in the hierarchy. Compared to Simba, NN-Baton generates mapping strategies that save 22.5%∼44% energy under the same computation and memory configurations.The architecture exploration demonstrates that area is a decisive factor for the chiplet granularity. For a 2048-MAC system under a 2 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> chiplet area constraint, the 4-chiplet implementation with 4 cores and 16 lanes of 8-size vector-MAC is always the top-pick computation allocation across several benchmarks. In contrast, the optimal memory allocation policy in the hierarchy typically depends on the neural network models.

A Scalable Multi-Chiplet Deep Learning Accelerator with Hub-Side 2.5D Heterogeneous Integration.

DaDianNao: A Machine-Learning Supercomputer

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

Computing Utilization Enhancement for Chiplet-based Homogeneous Processing-in-Memory Deep Learning Processors

An Efficient Design Framework for 2×2 CNN Accelerator Chiplet Cluster with SerDes Interconnects.

7.2 A 12nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS

A Scalable Network-on-Chip Microprocessor With 2.5D Integrated Memory and Accelerator.

NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators

IntAct: A 96-Core Processor With Six Chiplets 3D-Stacked on an Active Interposer With Distributed Interconnects and Integrated Power Management

A Small-Footprint Accelerator for Large-Scale Neural Networks

Neural Synaptic Plasticity-Inspired Computing: A High Computing Efficient Deep Convolutional Neural Network Accelerator

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing

All-rounder: A flexible DNN accelerator with diverse data format support

A Scalable And Reconfigurable 2.5d Integrated Multicore Processor On Silicon Interposer

Modular High-Performance Computing Using Chiplets

Octopus: A Heterogeneous In-network Computing Accelerator Enabling Deep Learning for network

UIC: A Unified and Scalable Chip Integrating Neuromorphic Computation and General Purpose Processor

Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering Dataflow

Multi-clusters: an Efficient Design Paradigm of NN Accelerator Architecture Based on FPGA

A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

Scalable Light-Weight Integration of FPGA Based Accelerators with Chip Multi-Processors