Abstract:Over the past few years, 2-D convolutional neural networks (CNNs) have demonstrated their great success in a wide range of 2-D computer vision applications, such as image classification and object detection. At the same time, 3-D CNNs, as a variant of 2-D CNNs, have shown their excellent ability to analyze 3-D data, such as video and geometric data. However, the heavy algorithmic complexity of 2-D and 3-D CNNs imposes a substantial overhead over the speed of these networks, which limits their deployment in real-life applications. Although various domain-specific accelerators have been proposed to address this challenge, most of them only focus on accelerating 2-D CNNs, without considering their computational efficiency on 3-D CNNs. In this article, we propose a unified hardware architecture to accelerate both 2-D and 3-D CNNs with high hardware efficiency. Our experiments demonstrate that the proposed accelerator can achieve up to 92.4% and 85.2% multiply-accumulate efficiency on 2-D and 3-D CNNs, respectively. To improve the hardware performance, we propose a hardware-friendly quantization approach called static block floating point (BFP), which eliminates the frequent representation conversions required in traditional dynamic BFP arithmetic. Comparing with the integer linear quantization using zero-point, the static BFP quantization can decrease the logic resource consumption of the convolutional kernel design by nearly 50% on a field-programmable gate array (FPGA). Without time-consuming retraining, the proposed static BFP quantization is able to quantize the precision to 8-bit mantissa with negligible accuracy loss. As different CNNs on our reconfigurable system require different hardware and software parameters to achieve optimal hardware performance and accuracy, we also propose an automatic tool for parameter optimization. Based on our hardware design and optimization, we demonstrate that the proposed accelerator can achieve 3.8-5.6 times higher energy efficiency than graphics processing unit (GPU) implementation. Comparing with the state-of-the-art FPGA-based accelerators, our design achieves higher generality and up to 1.4-2.2 times higher resource efficiency on both 2-D and 3-D CNNs.

Hardware Acceleration of Convolutional Neural Network Based on 3D-Cube Structure

Efficient Binary 3D Convolutional Neural Network and Hardware Accelerator.

Accelerating 3D Convolutional Neural Networks Using 3D Fast Fourier Transform

A High-Efficient and Configurable Hardware Accelerator for Convolutional Neural Network

F-C3D: FPGA-based 3-Dimensional Convolutional Neural Network.

An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs

High-Performance Acceleration of 2-D and 3-D CNNs on FPGAs Using Static Block Floating Point

High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization

Block Convolution: Towards Memory-Efficient Inference of Large-Scale CNNs on FPGA

Efficient Hardware Architectures for Deep Convolutional Neural Network

Design of FPGA Accelerator Architecture for Convolutional Neural Network

A FPGA-based Hardware Accelerator for Multiple Convolutional Neural Networks

FPGA-based Accelerator for Convolutional Neural Network

A Uniform Architecture Design for Accelerating 2D and 3D CNNs on FPGAs

Optimized Compression for Implementing Convolutional Neural Networks on FPGA

A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network

Flexible and Efficient Convolutional Acceleration on Unified Hardware Using the Two-Stage Splitting Method and Layer-Adaptive Allocation of 1-D/2-D Winograd Units

FPGA Accelerator for CNN: an Exploration of the Kernel Structured Sparsity and Hybrid Arithmetic Computation

A High Energy Efficiency and Low Resource Consumption FPGA Accelerator for Convolutional Neural Network

FPGA-Based High-Throughput CNN Hardware Accelerator With High Computing Resource Utilization Ratio