A Scalable Multi-Chiplet Deep Learning Accelerator with Hub-Side 2.5D Heterogeneous Integration.

Zhanhong Tan,Yifu Wu,Yannian Zhang,Haobing Shi,Wuke Zhang,Kaisheng Ma
DOI: https://doi.org/10.1109/hcs59251.2023.10254703
2023-01-01
Abstract:With the slowdown of Moore's law, the scenario diversity of specialized computing, and the rapid development of application algorithms, an efficient chip design requires modularization, flexibility, and scalability. In this study, we propose_ a Chiplet,-based deep, learning accelerator protoype that -contains oneHUB . Chipletand, six. extended SIDE- Chiplets integrated on an RDL layer for the 2.5D package. The SIDE and the HUB contain one and four AI cores, respectively. Given that our Chiplet-system targets diverse scenarios via scalable connected SIDE Chiplets, we need to handle three challenges: a) devise a flexible architecture design supporting diverse shapes, b) search for a workload mapping with low die-to-die communication, and c) adopt a high-bandwidth die-to-die interface to maintain efficient data transfer. This study proposes a flexible neural core (FNC) featuring dynamic bit-width computing and flexible parallelism. Next, we use a hierarchy-based mapping. scheme to decouple different parallelism levels and help analyze the communication. A 12Gbps,_D2D interface is introduced to achieve 192Gb/s bandwidth per D2D port with 1.04pJ/bit efficiency and 55um bump pitch. The proposed seven-Chiplet accelerator achieves a peak performance of 1 0/20/40 TOPS for INT16/8/4. When enabling 0~6 SIDE Chiplets, the system power ranges from 4.5W to 12W. The power efficiency of the FNC is 2.02TOPS/W while that of the overall system is 1.67TOPS/W.
What problem does this paper attempt to address?