Abstract:Over the past few years, 2-D convolutional neural networks (CNNs) have demonstrated their great success in a wide range of 2-D computer vision applications, such as image classification and object detection. At the same time, 3-D CNNs, as a variant of 2-D CNNs, have shown their excellent ability to analyze 3-D data, such as video and geometric data. However, the heavy algorithmic complexity of 2-D and 3-D CNNs imposes a substantial overhead over the speed of these networks, which limits their deployment in real-life applications. Although various domain-specific accelerators have been proposed to address this challenge, most of them only focus on accelerating 2-D CNNs, without considering their computational efficiency on 3-D CNNs. In this article, we propose a unified hardware architecture to accelerate both 2-D and 3-D CNNs with high hardware efficiency. Our experiments demonstrate that the proposed accelerator can achieve up to 92.4% and 85.2% multiply-accumulate efficiency on 2-D and 3-D CNNs, respectively. To improve the hardware performance, we propose a hardware-friendly quantization approach called static block floating point (BFP), which eliminates the frequent representation conversions required in traditional dynamic BFP arithmetic. Comparing with the integer linear quantization using zero-point, the static BFP quantization can decrease the logic resource consumption of the convolutional kernel design by nearly 50% on a field-programmable gate array (FPGA). Without time-consuming retraining, the proposed static BFP quantization is able to quantize the precision to 8-bit mantissa with negligible accuracy loss. As different CNNs on our reconfigurable system require different hardware and software parameters to achieve optimal hardware performance and accuracy, we also propose an automatic tool for parameter optimization. Based on our hardware design and optimization, we demonstrate that the proposed accelerator can achieve 3.8-5.6 times higher energy efficiency than graphics processing unit (GPU) implementation. Comparing with the state-of-the-art FPGA-based accelerators, our design achieves higher generality and up to 1.4-2.2 times higher resource efficiency on both 2-D and 3-D CNNs.

Optimization for Efficient Hardware Implementation of CNN on FPGA

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A Solution To Optimize Multi-Operand Adders In Cnn Architecture On Fpga

Power Efficient Tiny Yolo CNN Using Reduced Hardware Resources Based on Booth Multiplier and WALLACE Tree Adders

Full-stack Optimization for Accelerating CNNs with FPGA Validation

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

Optimization of Convolution Neural Network Algorithm Based on FPGA

Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs

Designing Deep Learning Hardware Accelerator and Efficiency Evaluation

Acceleration of Deep Neural Network Training Using Field Programmable Gate Arrays

FPGA-based Accelerator for Convolutional Neural Network

Optimization of the Convolution Operation to Accelerate Deep Neural Networks in FPGA

Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks (CNNs) on FPGAs

Optimizing Neural Networks for Efficient FPGA Implementation: A Survey

WPU: A FPGA-based Scalable, Efficient and Software/Hardware Co-design Deep Neural Network Inference Acceleration Processor

High-Performance Acceleration of 2-D and 3-D CNNs on FPGAs Using Static Block Floating Point

Survey of convolutional neural network accelerators on field-programmable gate array platforms: architectures and optimization techniques

Accelerating CNN inference on FPGAs: A Survey

Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration

Energy-Efficient Cnn Implementation on A Deeply Pipelined Fpga Cluster

A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA