Abstract:Recent research has shown that large language models (LLMs) can utilize low-precision floating point (FP) quantization to deliver high efficiency while maintaining original model accuracy. In particular, recent works have shown the effectiveness of non-power-of-two precisions, such as FP6 and FP5, and diverse sensitivity to low-precision arithmetic of LLM layers, which motivates mixed precision arithmetic including non-power-of-two precisions in LLMs. Although low-precision algorithmically leads to low computational overheads, such benefits cannot be fully exploited due to hardware constraints that support a limited set of power-of-two precisions (e.g., FP8, 16, 32, and 64 in NVIDIA H100 Tensor Core). In addition, the hardware compute units are designed to support standard formats (e.g., E4M3 and E5M2 for FP8). Such practices require re-designing the hardware whenever new precision and format emerge, which leads to high hardware replacement costs to exploit the benefits of new precisions and formats. Therefore, in this paper, we propose a new accelerator architecture, FlexiBit, which efficiently supports FP and INT arithmetic in arbitrary precisions and formats. Unlike previous bit-serial designs, which also provide flexibility but at the cost of performance due to its bit-wise temporal processing nature, FlexiBit's architecture enables bit-parallel processing of any precision and format without compute unit underutilization. FlexiBit's new capability to exploit non-power of two precision and format led to 1.66x and 1.62x higher performance per area on GPT-3 in FP6 targeting a cloud-scale accelerator, compared to a Tensor Core-like architecture and a state-of-the-art bit-parallel flexible precision accelerator, BitFusion, respectively. Also, the bit-parallel nature of FlexiBit's architecture led to 3.9x higher performance/area compared to a state-of-the-art bit-serial architecture.

Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks

FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support

Training Deep Neural Networks with 8-bit Floating Point Numbers

PositNN: Training Deep Neural Networks with Mixed Low-Precision Posit

Ascend HiFloat8 Format for Deep Learning

Training Deep Neural Networks Using Posit Number System.

FxpNet: Training a deep convolutional neural network in fixed-point representation

Deep Learning Training on the Edge with Low-Precision Posits

Low-Precision Floating-Point Schemes for Neural Network Training

Optimizing FPGA-Based DNN Accelerator with Shared Exponential Floating-Point Format

FlexiBit: Fully Flexible Precision Bit-parallel Accelerator Architecture for Arbitrary Mixed Precision AI

Exploring the Potential of Flexible 8-Bit Format: Design and Algorithm

Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads

Improving Neural Network Efficiency Via Post-training Quantization with Adaptive Floating-Point

LBFP: Logarithmic Block Floating Point Arithmetic for Deep Neural Networks

Representation range needs for 16-bit neural network training

FP8 versus INT8 for efficient deep learning inference

Schrödinger's FP: Dynamic Adaptation of Floating-Point Containers for Deep Learning Training

Training High-Performance and Large-Scale Deep Neural Networks with Full 8-Bit Integers.

Training DNNs with Hybrid Block Floating Point

FlexNN: A Dataflow-aware Flexible Deep Learning Accelerator for Energy-Efficient Edge Devices