Abstract:Recent research has shown that large language models (LLMs) can utilize low-precision floating point (FP) quantization to deliver high efficiency while maintaining original model accuracy. In particular, recent works have shown the effectiveness of non-power-of-two precisions, such as FP6 and FP5, and diverse sensitivity to low-precision arithmetic of LLM layers, which motivates mixed precision arithmetic including non-power-of-two precisions in LLMs. Although low-precision algorithmically leads to low computational overheads, such benefits cannot be fully exploited due to hardware constraints that support a limited set of power-of-two precisions (e.g., FP8, 16, 32, and 64 in NVIDIA H100 Tensor Core). In addition, the hardware compute units are designed to support standard formats (e.g., E4M3 and E5M2 for FP8). Such practices require re-designing the hardware whenever new precision and format emerge, which leads to high hardware replacement costs to exploit the benefits of new precisions and formats. Therefore, in this paper, we propose a new accelerator architecture, FlexiBit, which efficiently supports FP and INT arithmetic in arbitrary precisions and formats. Unlike previous bit-serial designs, which also provide flexibility but at the cost of performance due to its bit-wise temporal processing nature, FlexiBit's architecture enables bit-parallel processing of any precision and format without compute unit underutilization. FlexiBit's new capability to exploit non-power of two precision and format led to 1.66x and 1.62x higher performance per area on GPT-3 in FP6 targeting a cloud-scale accelerator, compared to a Tensor Core-like architecture and a state-of-the-art bit-parallel flexible precision accelerator, BitFusion, respectively. Also, the bit-parallel nature of FlexiBit's architecture led to 3.9x higher performance/area compared to a state-of-the-art bit-serial architecture.

A High-Precision Flexible Symmetry-Aware Architecture for Element-Wise Activation Functions

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

Designing DNNs for a trade-off between robustness and processing performance in embedded devices

A Fine-Grained Sparse Accelerator for Multi-Precision DNN.

Efficient Neural Networks on the Edge with FPGAs by Optimizing an Adaptive Activation Function

A Low-Cost Reconfigurable Nonlinear Core for Embedded DNN Applications

HAO: Hardware-aware neural Architecture Optimization for Efficient Inference

A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network

An Efficient Asymmetric Nonlinear Activation Function for Deep Neural Networks

ONE-SA: Enabling Nonlinear Operations in Systolic Arrays for Efficient and Flexible Neural Network Inference

SymNMF-Net for The Symmetric NMF Problem

A DNN Optimization Framework with Unlabeled Data for Efficient and Accurate Reconfigurable Hardware Inference

FlexiBit: Fully Flexible Precision Bit-parallel Accelerator Architecture for Arbitrary Mixed Precision AI

Floating-Point Approximation Enabling Cost-Effective and High-Precision Digital Implementation of FitzHugh-Nagumo Neural Networks

Symmetry-Based Structured Matrices for Efficient Approximately Equivariant Networks

Design Space Exploration of Neural Network Activation Function Circuits

A Precision-Optimized Fixed-Point Near-Memory Digital Processing Unit for Analog In-Memory Computing

ABS: Accumulation Bit-Width Scaling Method for Designing Low-Precision Tensor Core

Optimal Architecture of Floating-Point Arithmetic for Neural Network Training Processors