Abstract:Today, convolutional anddeconvolutional neural network models are exceptionally popular thanks to the impressive accuracies they have been proven in several computer-vision applications. To speed up the overall tasks of these neural networks, purpose-designed accelerators are highly desirable. Unfortunately, the high computational complexity and the huge memory demand make the design of efficient hardware architectures, as well as their deployment in resource- and power-constrained embedded systems, still quite challenging. This paper presents a novel purpose-designed hardware accelerator to perform 2D deconvolutions. The proposed structure applies a hardware-oriented computational approach that overcomes the issues of traditional deconvolution methods, and it is suitable for being implemented within any virtually system-on-chip based on field-programmable gate array devices. In fact, the novel accelerator is simply scalable to comply with resources available within both high- and low-end devices by adequately scaling the adopted parallelism. As an example, when exploited to accelerate the Deep Convolutional Generative Adversarial Network model, the novel accelerator, running as a standalone unit implemented within the Xilinx Zynq XC7Z020 System-on-Chip (SoC) device, performs up to 72 GOPs. Moreover, it dissipates less than 500mW@200MHz and occupies 5.6%, 4.1%, 17%, and 96%, respectively, of the look-up tables, flip-flops, random access memory, and digital signal processors available on-chip. When accommodated within the same device, the whole embedded system equipped with the novel accelerator performs up to 54 GOPs and dissipates less than 1.8W@150MHz. Thanks to the increased parallelism exploitable, more than 900 GOPs can be executed when the high-end Virtex-7 XC7VX690T device is used as the implementation platform. Moreover, in comparison with state-of-the-art competitors implemented within the Zynq XC7Z045 device, the system proposed here reaches a computational capability up to 20% higher, and saves more than 60% and 80% of power consumption and logic resources requirement, respectively, using 5.7× fewer on-chip memory resources.

A Memory-Efficient Hardware Architecture for Deformable Convolutional Networks

Algorithm-hardware Co-design for Deformable Convolution

A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network

Hardware Implementation of Depthwise Separable Convolution Neural Network

Efficient Deconvolution Architecture for Heterogeneous Systems-on-Chip

Efficient Hardware Architectures for Deep Convolutional Neural Network

Memory-centric accelerator design for Convolutional Neural Networks

CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs

A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA

DSA-CNN: an fpga-integrated deformable systolic array for convolutional neural network acceleration

Energy-Efficient Accelerator Design for Deformable Convolution Networks

A flexible FPGA accelerator for convolutional neural networks

A High-Efficient and Configurable Hardware Accelerator for Convolutional Neural Network

Towards a Uniform Architecture for the Efficient Implementation of 2D and 3D Deconvolutional Neural Networks on FPGAs

Three-level Memory Access Architecture for FPGA-based Real-time Remote Sensing Image Processing System

A Unified Hardware Architecture for Convolutions and Deconvolutions in CNN

Efficient Inference of Large-Scale and Lightweight Convolutional Neural Networks on FPGA

Design of a Convolutional Neural Network Accelerator Based on On-Chip Data Reordering

An FPGA-Based Energy-Efficient Reconfigurable Convolutional Neural Network Accelerator for Object Recognition Applications

Design of a Generic Dynamically Reconfigurable Convolutional Neural Network Accelerator with Optimal Balance