Abstract:Recent research has shown that large language models (LLMs) can utilize low-precision floating point (FP) quantization to deliver high efficiency while maintaining original model accuracy. In particular, recent works have shown the effectiveness of non-power-of-two precisions, such as FP6 and FP5, and diverse sensitivity to low-precision arithmetic of LLM layers, which motivates mixed precision arithmetic including non-power-of-two precisions in LLMs. Although low-precision algorithmically leads to low computational overheads, such benefits cannot be fully exploited due to hardware constraints that support a limited set of power-of-two precisions (e.g., FP8, 16, 32, and 64 in NVIDIA H100 Tensor Core). In addition, the hardware compute units are designed to support standard formats (e.g., E4M3 and E5M2 for FP8). Such practices require re-designing the hardware whenever new precision and format emerge, which leads to high hardware replacement costs to exploit the benefits of new precisions and formats. Therefore, in this paper, we propose a new accelerator architecture, FlexiBit, which efficiently supports FP and INT arithmetic in arbitrary precisions and formats. Unlike previous bit-serial designs, which also provide flexibility but at the cost of performance due to its bit-wise temporal processing nature, FlexiBit's architecture enables bit-parallel processing of any precision and format without compute unit underutilization. FlexiBit's new capability to exploit non-power of two precision and format led to 1.66x and 1.62x higher performance per area on GPT-3 in FP6 targeting a cloud-scale accelerator, compared to a Tensor Core-like architecture and a state-of-the-art bit-parallel flexible precision accelerator, BitFusion, respectively. Also, the bit-parallel nature of FlexiBit's architecture led to 3.9x higher performance/area compared to a state-of-the-art bit-serial architecture.

MiniFloats on RISC-V Cores: ISA Extensions with Mixed-Precision Short Dot Products

MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

FPnew: An Open-Source Multi-Format Floating-Point Unit Architecture for Energy-Proportional Transprecision Computing

Manticore: A 4096-core RISC-V Chiplet Architecture for Ultra-efficient Floating-point Computing

Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi-Pumped Soft SIMD Operations

Designing an IEEE-compliant FPU that supports configurable precision for soft processors

A 3 TOPS/W RISC-V Parallel Cluster for Inference of Fine-Grain Mixed-Precision Quantized Neural Networks

A Mixed-Precision RISC-V Processor for Extreme-Edge DNN Inference

An Eight-Core 1.44-GHz RISC-V Vector Processor in 16-nm FinFET

PPU: Design and Implementation of a Pipelined Full Posit Processing Unit

XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Network on RISC-V based IoT End Nodes

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

Compressed Real Numbers for AI: a case-study using a RISC-V CPU

Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency

A Low-Cost Floating-Point FMA Unit Supporting Package Operations for HPC-AI Applications

Exploring and Exploiting Runtime Reconfigurable Floating Point Precision in Scientific Computing: a Case Study for Solving PDEs

A near-threshold RISC-V core with DSP extensions for scalable IoT Endpoint Devices

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

FlexiBit: Fully Flexible Precision Bit-parallel Accelerator Architecture for Arbitrary Mixed Precision AI

A Precision-Scalable RISC-V DNN Processor with On-Device Learning Capability at the Extreme Edge