Abstract:There is a high energy cost associated with training Deep Neural Networks (DNNs). Off-chip memory access contributes a major portion to the overall energy consumption. Reduction in the number of off-chip memory transactions can be achieved by quantizing the data words to low data bit-width (E.g., 8-bit). However, low-bit-width data formats suffer from a limited dynamic range, resulting in reduced accuracy. In this paper, a novel 8-bit Floating Point (FP8) data format quantized DNN training methodology is presented, which adapts to the required dynamic range on-the-fly. Our methodology relies on varying the bias values of FP8 format to fit the dynamic range to the required range of DNN parameters and input feature maps. The range fitting during the training is adaptively performed by an online statistical analysis hardware unit without stalling the computation units or its data accesses. Our approach is compatible with any DNN compute cores without any major modifications to the architecture. We propose to integrate the new FP8 quantization unit in the memory controller. The FP32 data from the compute core are converted to FP8 in the memory controller before writing to the DRAM and converted back after reading the data from DRAM. Our results show that the DRAM access energy is reduced by 3.07 while using an 8-bit data format instead of using 32-bit. The accuracy loss of the proposed methodology with 8-bit quantized training is for various networks with image and natural language processing datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to reduce the energy consumption of off - chip memory (such as DRAM) data access during the training process of deep neural networks (DNN). Specifically, the paper proposes a new adaptive quantization method, using the 8 - bit floating - point (FP8) data format to replace the traditional 32 - bit floating - point (FP32) data format for DNN training. This method dynamically adjusts the bias value of FP8 through online statistical analysis of hardware units to adapt to the dynamic range required by DNN parameters and input feature maps, thereby significantly reducing DRAM data access energy consumption while maintaining high accuracy. ### Main contributions of the paper: 1. **New online statistical analysis method**: A method based on online statistical analysis is proposed to reduce the DRAM data access energy consumption when using the FP8 data format for DNN training. This method is general and compatible with any computing core, and is implemented by integrating the online statistical analysis unit into the DRAM memory controller. 2. **Low - area, low - power online median calculation unit**: A novel approximate online median calculation unit is designed to identify the bias value of FP8, and this unit does not add extra latency to memory transactions. 3. **Extensive experimental verification**: Through a large number of experiments on multiple data sets and network structures, the effectiveness of this method is verified, and the design of the statistical unit and hardware results are shown in detail. ### Method overview: - **Initial stage**: In the first few epochs (denoted as 'e') of training, all data are stored in DRAM in FP32 format. These data are sampled by the online statistical analysis unit to calculate the appropriate bias value. - **Bias value calculation**: By calculating the median of DNN data and matching it with the median in the reference table, an appropriate bias value is selected. - **Quantization and de - quantization**: Starting from the 'e + 1' epoch, DRAM write requests are quantized into the FP8 format, and read requests are de - quantized back to the FP32 format in the memory controller, thereby reducing the total number of DRAM accesses and energy consumption. ### Key technical details: - **Median calculation**: The median is used as a statistical indicator because it is more robust to outliers. By dividing the data range into multiple bins, the estimation of the median is gradually refined. - **Hardware implementation**: A four - stage median calculation process is designed, with each stage corresponding to an epoch. By minimizing hardware overhead, it is ensured that the calculation process does not introduce extra latency. ### Experimental results: - **Energy consumption reduction**: Using the FP8 data format compared to the FP32 data format, the DRAM access energy consumption is reduced by 3.07 times. - **Accuracy loss**: On multiple networks and data sets, the accuracy loss of training with 8 - bit quantization is approximately 1%. ### Conclusion: This paper proposes an effective online adaptive quantization method. By dynamically adjusting the bias value of FP8, it significantly reduces the DRAM data access energy consumption during the DNN training process while maintaining high training accuracy. This method has important energy - efficiency advantages in practical applications.

Novel adaptive quantization methodology for 8-bit floating-point DNN training

Training High-Performance and Large-Scale Deep Neural Networks with Full 8-Bit Integers.

Training Deep Neural Networks with 8-bit Floating Point Numbers

Towards Accurate and Efficient Sub-8-Bit Integer Training

Efficient Post-training Quantization with FP8 Formats

FP8 versus INT8 for efficient deep learning inference

Towards efficient full 8-bit integer DNN online training on resource-limited devices without batch normalization

Gradient Distribution-aware INT8 Training for Neural Networks

Exploiting Retraining-Based Mixed-Precision Quantization for Low-Cost DNN Accelerator Design

Low Precision Quantization-aware Training in Spiking Neural Networks with Differentiable Quantization Function

Dataflow-Based Joint Quantization for Deep Neural Networks

ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization.

BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices

Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition

Bit Efficient Quantization for Deep Neural Networks

AdaQAT: Adaptive Bit-Width Quantization-Aware Training

Low-Bitwidth Floating Point Quantization for Efficient High-Quality Diffusion Models

Dataflow-based Joint Quantization of Weights and Activations for Deep Neural Networks

Low-Precision Floating-Point for Efficient On-Board Deep Neural Network Processing

Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks

A 4-Bit Integer-Only Neural Network Quantization Method Based on Shift Batch Normalization