Abstract:Precision scaling has emerged as a popular technique to optimize the compute and storage requirements of Deep Neural Networks (DNNs). Efforts toward creating ultra-low-precision (sub-8-bit) DNNs suggest that the minimum precision required to achieve a given network-level accuracy varies considerably across networks, and even across layers within a network, requiring support for variable precision in DNN hardware. Previous proposals such as bit-serial hardware incur high overheads, significantly diminishing the benefits of lower precision. To efficiently support precision re-configurability in DNN accelerators, we introduce an approximate computing method wherein DNN computations are performed block-wise (a block is a group of bits) and re-configurability is supported at the granularity of blocks. Results of block-wise computations are composed in an approximate manner to enable efficient re-configurability. We design a DNN accelerator that embodies approximate blocked computation and propose a method to determine a suitable approximation configuration for a given DNN. By varying the approximation configurations across DNNs, we achieve 1.17x-1.73x and 1.02x-2.04x improvement in system energy and performance respectively, over an 8-bit fixed-point (FxP8) baseline, with negligible loss in classification accuracy. Further, by varying the approximation configurations across layers and data-structures within DNNs, we achieve 1.25x-2.42x and 1.07x-2.95x improvement in system energy and performance respectively, with negligible accuracy loss.
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: **How to efficiently support variable - precision computing in deep neural network (DNN) hardware accelerators to achieve higher energy efficiency and performance while maintaining relatively low precision loss**.
Specifically, the paper focuses on:
1. **The need for reduced precision**: In order to improve the energy efficiency of DNN inference, using low - precision (sub - 8 - bit) computing is a popular technique. However, the minimum precision required for different networks, different layers, and even different data structures varies greatly, which requires that the hardware be able to support variable - precision computing.
2. **Limitations of existing methods**: Existing variable - precision hardware (such as bit - serial architectures) can achieve variable - precision computing, but it will bring high energy and latency overheads, thereby weakening the advantages brought by low - precision.
3. **Proposing the Ax - BxP method**: To solve the above problems, the paper proposes Ax - BxP (Approximate Blocked Computation), an approximate blocked - computing method. This method introduces approximation by performing multiply - accumulate operations in blocks and only performing some of the required block - level computations, thereby achieving efficient variable - precision computing.
### Main features of Ax - BxP:
- **Block - level computing**: Divide weights and activation values into fixed - length blocks, each block containing multiple bits.
- **Approximate computing**: Introduce approximation by only performing some block - level computations, thereby achieving an efficient variable - precision configuration.
- **Hardware design**: Propose an architectural enhancement of the DNN accelerator based on the standard systolic array to support Ax - BxP computing.
### Experimental results:
For DNN models such as AlexNet, ResNet50, and MobileNetV2, the Ax - BxP method achieved improvements of 1.1x - 1.74x and 1.02x - 2x in system energy consumption and performance respectively, and the loss in classification accuracy was very small (less than 1% on average). In addition, by more finely adjusting the approximate configuration in different layers and data structures of the DNN, the system energy consumption and performance were further improved (improvements of 1.12x - 2.23x and 1.14x - 2.34x respectively).
### Summary:
By proposing the Ax - BxP method, the paper solves the problem of efficiently supporting variable - precision computing in DNN hardware accelerators, significantly improving energy efficiency and performance while maintaining relatively low precision loss.