Abstract:The rapid advancement in AI requires efficient accelerators for training on edge devices, which often face challenges related to the high hardware costs of floating-point arithmetic operations. To tackle these problems, efficient floating-point formats inspired by block floating-point (BFP), such as Microsoft Floating Point (MSFP) and FlexBlock (FB), are emerging. However, they have limited dynamic range and precision for the smaller magnitude values within a block due to the shared exponent. This limits the BFP's ability to train deep neural networks (DNNs) with diverse datasets. This paper introduces the hybrid precision (HPFP) selection algorithms, designed to systematically reduce precision and implement hybrid precision strategies, thereby balancing layer-wise arithmetic operations and data path precision to address the shortcomings of traditional floating-point formats. Reducing the data bit width with HPFP allows more read/write operations from memory per cycle, thereby decreasing off-chip data access and the size of on-chip memories. Unlike traditional reduced precision formats that use BFP for calculating partial sums and accumulating those partial sums in 32-bit Floating Point (FP32), HPFP leads to significant hardware savings by performing all multiply and accumulate operations in reduced floating-point format. For evaluation, two training accelerators for the YOLOv2-Tiny model were developed, employing distinct mixed precision strategies, and their performance was benchmarked against an accelerator utilizing a conventional brain floating point of 16 bits (Bfloat16). The HPFP selection, employing 10 bits for the data path of all layers and for the arithmetic of layers requiring low precision, along with 12 bits for layers requiring higher precision, results in a 49.4% reduction in energy consumption and a 37.5% decrease in memory access. This is achieved with only a marginal mean Average Precision (mAP) degradation of 0.8% when compared to an accelerator based on Bfloat16. This comparison demonstrates that the proposed accelerator based on HPFP can be an efficient approach to designing compact and low-power accelerators without sacrificing accuracy.

A 28nm 1.07TFLOPS/mm<sup>2</sup> Dynamic-Precision Training Processor with Online Dynamic Execution and Multi- Level-Aligned Block-FP Processing

A Dynamic Execution Neural Network Processor for Fine-Grained Mixed-Precision Model Training Based on Online Quantization Sensitivity Analysis

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

Hybrid Precision Floating-Point (HPFP) Selection to Optimize Hardware-Constrained Accelerator for CNN Training

A 28nm 276.55TFLOPS/W Sparse Deep-Neural-Network Training Processor with Implicit Redundancy Speculation and Batch Normalization Reformulation

A Precision-Scalable RISC-V DNN Processor with On-Device Learning Capability at the Extreme Edge

PL-NPU: an Energy-Efficient Edge-Device DNN Training Processor with Posit-Based Logarithm-Domain Computing

Optimal Architecture of Floating-Point Arithmetic for Neural Network Training Processors

Dynamic Precision Multiplier For Deep Neural Network Accelerators

A 28nm 128TFLOPS/W Computing-In-Memory Engine Supporting One-Shot Floating-Point NN Inference and On-Device Fine-Tuning for Edge AI

A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable Digital CIM Processor with Unified FP/INT Pipeline and Bitwise In-Memory Booth Multiplication for Cloud Deep Learning Acceleration

Trainer: an Energy-Efficient Edge-Device Training Processor Supporting Dynamic Weight Pruning

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Training Deep Neural Networks with 8-bit Floating Point Numbers

Training DNNs with Hybrid Block Floating Point

Adaptive Precision Training for Resource Constrained Devices

EF-Train: Enable Efficient On-device CNN Training on FPGA Through Data Reshaping for Online Adaptation or Personalization

A 22nm 54.94TFLOPS/W Transformer Fine-Tuning Processor with Exponent-Stationary Re-Computing, Aggressive Linear Fitting, and Logarithmic Domain Multiplicating

FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding

A 1.06-to-5.09 TOPS/W Reconfigurable Hybrid-Neural-Network Processor for Deep Learning Applications