Abstract:In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are com-putational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN. In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric. Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperform previous approaches significantly.

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs

Accelerating convolutional neural networks on FPGAs

A High-efficiency FPGA-based Accelerator for Convolutional Neural Networks using Winograd Algorithm

Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks (CNNs) on FPGAs

Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs

A Scalable FPGA Accelerator for Convolutional Neural Networks.

A High Performance FPGA-based Accelerator for Large-Scale Convolutional Neural Networks

Efficient Inference of Large-Scale and Lightweight Convolutional Neural Networks on FPGA

Design of FPGA-Based Accelerator for Convolutional Neural Network under Heterogeneous Computing Framework with OpenCL

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

Face Recognition with Hybrid Efficient Convolution Algorithms on FPGAs

Optimizing Convolutional Neural Network Accelerator on Low-Cost FPGA

Instruction driven cross-layer CNN accelerator with winograd transformation on FPGA

Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks

Accelerating CNN inference on FPGAs: A Survey

Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA

Throughput-Optimized Opencl-Based Fpga Accelerator For Large-Scale Convolutional Neural Networks

Going Deeper with Embedded FPGA Platform for Convolutional Neural Network