Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

Gianna Paulin,Paul Scheffler,Thomas Benz,Matheus Cavalcante,Tim Fischer,Manuel Eggimann,Yichao Zhang,Nils Wistoff,Luca Bertaccini,Luca Colagrande,Gianmarco Ottavi,Frank K. Gürkaynak,Davide Rossi,Luca Benini
2024-06-21
Abstract:We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stencils (83 %), sparse-dense (42 %), and sparse-sparse (49 %) matrix multiply.
Hardware Architecture
What problem does this paper attempt to address?
This paper introduces a high - performance computing accelerator named Occamy, which aims to solve the problem of inefficiency in sparse linear algebra and stencil computations on modern CPUs and GPUs. Specifically, these computational tasks usually result in extremely low utilization of floating - point units (FPU) (usually less than 10%) due to their sparsity and irregular memory access patterns. Occamy improves the performance and energy efficiency of these tasks through the following three main innovations: 1. **Efficient multi - precision computing core**: Occamy is equipped with SIMD (Single Instruction Multiple Data) floating - point units that support 8 - to 64 - bit floating - point data, and integrates sparse stream units (SUs) that can perform indirect, intersection, and union operations, thus accelerating general - purpose sparse computations. 2. **Scalable latency - tolerant hierarchical architecture**: The system is designed with independent data and control interconnections and distributed DMA units to flexibly and efficiently handle on - chip and off - chip data transfers. 3. **Innovative 2.5D packaging integration technology**: Occamy uses two compute chiplets and two 16 - GB HBM2E stacks, achieving efficient integration through 2.5D packaging technology, which improves the overall performance and energy efficiency of the system. Through these innovations, Occamy performs excellently in multiple benchmark tests. In particular, in FP64 stencil code and sparse - dense matrix multiplication, it achieves up to 3.9 - fold and 4.6 - fold accelerations respectively, with FPU utilizations reaching 83% and 42% respectively. In addition, in sparse - sparse matrix multiplication, Occamy also shows significant advantages, achieving a performance of 187 GCOMP/s with an FPU utilization of 49%. These results indicate that Occamy has significant advantages in handling workloads with sparse and irregular memory accesses.