Abstract:Convolutional neural networks (CNNs) have been widely used in many tasks, but training CNNs is time-consuming and energy-hungry. Using the low-bit integer format has been proved promising for speeding up and improving the energy efficiency of CNN inference, while the training phase of CNNs can hardly benefit from such a technique because of following challenges: (1) The integer data format cannot meet the requirements of the data dynamic range in training, resulting in the accuracy drop; (2) The floating-point data format keeps large dynamic range with much more exponent bits, resulting in higher accumulation power than integer one; (3) There are some specially designed data formats (e.g., with group-wise scaling) that have the potential to deal with the former two problems but the common hardware can not support them efficiently. To tackle all these challenges and make the training phase of CNNs benefit from the low-bit format, we propose a low-bit training framework for convolutional neural networks to pursue a better trade-off between the accuracy and energy efficiency. (1) We adopt element-wise scaling to improve the dynamic range of data representation, which greatly reduces the quantization error; (2) Group-wise scaling with hardware friendly factor format is designed to reduce the element-wise exponent bits without degrading the accuracy; (3) We design the customized hardware unit that implement the low-bit tensor convolution arithmetic with our multi-level scaling data format. Experiments show that our framework achieves a superior trade-off between the accuracy and the bit-width than previous low-bit training studies. For training a variety of models on CIFAR-10, using 1-bit mantissa and 2-bit exponent is adequate to keep the accuracy loss within 1%. And on larger datasets like ImageNet, using 4-bit mantissa and 2-bit exponent is adequate. Through the energy consumption simulation of the computing units,we can estimate that training a variety of models with our framework could achieve 8.3 ∼ 10.2× and 1.9 ∼ 2.3× higher energy efficiency than single-precision and 8-bit floating-point arithmetic, respectively.

Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

BOOST: Block Minifloat-Based On-Device CNN Training Accelerator with Transfer Learning

ADaPTION: Toolbox and Benchmark for Training Convolutional Neural Networks with Reduced Numerical Precision Weights and Activation

Fixed-point Quantization of Convolutional Neural Networks for Quantized Inference on Embedded Platforms

Energy-Efficient Architecture for FPGA-based Deep Convolutional Neural Networks with Binary Weights

Towards Lower Bit Multiplication for Convolutional Neural Network Training

A Comparison Among Different Numeric Representations in Deep Convolution Neural Networks

SG-Float: Achieving Memory Access and Computing Power Reduction Using Self-Gating Float in CNNs

Efficient Neural Image Decoding Via Fixed-Point Inference

Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration

Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights

Exploring the Potential of Low-bit Training of Convolutional Neural Networks

Retrain-Less Weight Quantization for Multiplier-Less Convolutional Neural Networks

ShiftCNN: Generalized Low-Precision Architecture for Inference of Convolutional Neural Networks

Training Deep Neural Networks with 8-bit Floating Point Numbers

An Efficient Kernel Transformation Architecture for Binary- and Ternary-Weight Neural Network Inference.

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Deep Neural Network inference with reduced word length

Efficient Inference of Large-Scale and Lightweight Convolutional Neural Networks on FPGA