LHC: A Low-Power Heterogeneous Computing Method on Neural Network Accelerator

Fangxin Liu,Kunpeng Xie,Cheng Gong,Shusheng Liu,Ye Lu,Tao Li
DOI: https://doi.org/10.1109/icpads47876.2019.00053
2019-01-01
Abstract:Accelerators can achieve high performance and low energy consumption in training or inference of neural networks. If the Non-Neural Network (Non-NN) algorithms with large amount of computation could make full use of the accelerators, it is possible to speed up its implementation, reduce energy consumption, and achieve load balancing, especially on mobile devices equipped with accelerators. However, accelerators are dedicated to neural network calculations, so that other Non-NN algorithms have difficulty in using their advantages. Furthermore, many hardware-specific restrictions have become the obstacles, such as constrained precision of operands and limited computation scale. In this paper, we propose a method named Low-power Heterogeneous Computing (LHC) to bridge the gap between Non-NN algorithms and NN accelerators. Firstly, we analyze the general principle of the accelerator and reveal the calculation model of the accelerator. To hide the details of the underlying neural network library, we extract some operators from the limited number of types of neural network computation they support. We encapsulate the low-level library, extract operators suitable for general algorithms, and implement some more advanced operators that can adapt to the constrained hardware conditions. These operators could facilitate programmers to implement some Non-NN algorithms. In the aspect of the algorithm, we extract the computationally intensive parts of the Non-NN algorithm and deploy these computational tasks on the accelerator by calling the operators. To verify our method, we implement three Non-NN algorithms by using operators and adjusting these algorithms, include Grid-based Motions Statistics, k-Nearest Neighbors, and k-Means, on a specific accelerator, Cambricon-1A. The experimental results show that the energy consumption of calculation is reduced by up to 5.4x, compared with the CPU baseline. Our method can be further applied to other similar accelerators.
What problem does this paper attempt to address?