Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

Gianna Paulin,Paul Scheffler,Thomas Benz,Matheus Cavalcante,Tim Fischer,Manuel Eggimann,Yichao Zhang,Nils Wistoff,Luca Bertaccini,Luca Colagrande,Gianmarco Ottavi,Frank K. Gürkaynak,Davide Rossi,Luca Benini

2024-06-21

Abstract:We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stencils (83 %), sparse-dense (42 %), and sparse-sparse (49 %) matrix multiply.

Hardware Architecture

What problem does this paper attempt to address?

This paper introduces a high - performance computing accelerator named Occamy, which aims to solve the problem of inefficiency in sparse linear algebra and stencil computations on modern CPUs and GPUs. Specifically, these computational tasks usually result in extremely low utilization of floating - point units (FPU) (usually less than 10%) due to their sparsity and irregular memory access patterns. Occamy improves the performance and energy efficiency of these tasks through the following three main innovations: 1. **Efficient multi - precision computing core**: Occamy is equipped with SIMD (Single Instruction Multiple Data) floating - point units that support 8 - to 64 - bit floating - point data, and integrates sparse stream units (SUs) that can perform indirect, intersection, and union operations, thus accelerating general - purpose sparse computations. 2. **Scalable latency - tolerant hierarchical architecture**: The system is designed with independent data and control interconnections and distributed DMA units to flexibly and efficiently handle on - chip and off - chip data transfers. 3. **Innovative 2.5D packaging integration technology**: Occamy uses two compute chiplets and two 16 - GB HBM2E stacks, achieving efficient integration through 2.5D packaging technology, which improves the overall performance and energy efficiency of the system. Through these innovations, Occamy performs excellently in multiple benchmark tests. In particular, in FP64 stencil code and sparse - dense matrix multiplication, it achieves up to 3.9 - fold and 4.6 - fold accelerations respectively, with FPU utilizations reaching 83% and 42% respectively. In addition, in sparse - sparse matrix multiplication, Occamy also shows significant advantages, achieving a performance of 187 GCOMP/s with an FPU utilization of 49%. These results indicate that Occamy has significant advantages in handling workloads with sparse and irregular memory accesses.

Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

Manticore: A 4096-core RISC-V Chiplet Architecture for Ultra-efficient Floating-point Computing

MiniFloats on RISC-V Cores: ISA Extensions with Mixed-Precision Short Dot Products

An Eight-Core 1.44-GHz RISC-V Vector Processor in 16-nm FinFET

22.1 A 12.4TOPS/W @ 136GOPS AI-IoT System-on-Chip with 16 RISC-V, 2-to-8b Precision-Scalable DNN Acceleration and 30%-Boost Adaptive Body Biasing

Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of-Order 64-Bit High Performance RISC-V Processor with Vector Extension : Industrial Product

Enabling Efficient Hybrid Systolic Computation in Shared-L1-Memory Manycore Clusters

A 98 Gmacs/W 32-Core Vector Processor In 65 Nm Cmos

Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra

Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency

MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory

Culsans: An Efficient Snoop-based Coherency Unit for the CVA6 Open Source RISC-V application processor

MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

ANDROMEDA: An FPGA Based RISC-V MPSoC Exploration Framework

Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters

Marsellus: A Heterogeneous RISC-V AI-IoT End-Node SoC with 2-to-8b DNN Acceleration and 30%-Boost Adaptive Body Biasing

Circular Reconfigurable Parallel Processor for Edge Computing : Industrial Product ✶

IntAct: A 96-Core Processor With Six Chiplets 3D-Stacked on an Active Interposer With Distributed Interconnects and Integrated Power Management

A near-threshold RISC-V core with DSP extensions for scalable IoT Endpoint Devices

A 28nm 16.9-300TOPS/W Computing-in-Memory Processor Supporting Floating-Point NN Inference/Training with Intensive-CIM Sparse-Digital Architecture