Abstract:Today, convolutional anddeconvolutional neural network models are exceptionally popular thanks to the impressive accuracies they have been proven in several computer-vision applications. To speed up the overall tasks of these neural networks, purpose-designed accelerators are highly desirable. Unfortunately, the high computational complexity and the huge memory demand make the design of efficient hardware architectures, as well as their deployment in resource- and power-constrained embedded systems, still quite challenging. This paper presents a novel purpose-designed hardware accelerator to perform 2D deconvolutions. The proposed structure applies a hardware-oriented computational approach that overcomes the issues of traditional deconvolution methods, and it is suitable for being implemented within any virtually system-on-chip based on field-programmable gate array devices. In fact, the novel accelerator is simply scalable to comply with resources available within both high- and low-end devices by adequately scaling the adopted parallelism. As an example, when exploited to accelerate the Deep Convolutional Generative Adversarial Network model, the novel accelerator, running as a standalone unit implemented within the Xilinx Zynq XC7Z020 System-on-Chip (SoC) device, performs up to 72 GOPs. Moreover, it dissipates less than 500mW@200MHz and occupies 5.6%, 4.1%, 17%, and 96%, respectively, of the look-up tables, flip-flops, random access memory, and digital signal processors available on-chip. When accommodated within the same device, the whole embedded system equipped with the novel accelerator performs up to 54 GOPs and dissipates less than 1.8W@150MHz. Thanks to the increased parallelism exploitable, more than 900 GOPs can be executed when the high-end Virtex-7 XC7VX690T device is used as the implementation platform. Moreover, in comparison with state-of-the-art competitors implemented within the Zynq XC7Z045 device, the system proposed here reaches a computational capability up to 20% higher, and saves more than 60% and 80% of power consumption and logic resources requirement, respectively, using 5.7× fewer on-chip memory resources.

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

DaDianNao: A Machine-Learning Supercomputer

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

7.2 A 12nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS

A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network

A 3D Tiled Low Power Accelerator for Convolutional Neural Network

A Small-Footprint Accelerator for Large-Scale Neural Networks

Efficient Deconvolution Architecture for Heterogeneous Systems-on-Chip

Design of a Convolutional Neural Network Accelerator Based on On-Chip Data Reordering

A Reconfigurable Accelerator for Sparse Convolutional Neural Networks.

A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator

GNA: Reconfigurable and Efficient Architecture for Generative Network Acceleration

Design of a Generic Dynamically Reconfigurable Convolutional Neural Network Accelerator with Optimal Balance

An Efficient Accelerator for Sparse Convolutional Neural Networks

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

A Reconfigurable Spatial Architecture for Energy-Efficient Inception Neural Networks

A Reconfigurable Computing-in-Memory Accelerator with Dynamic Group-Based Dataflow and Dual-Input Macro Designs

A Fine-Grained Sparse Accelerator for Multi-Precision DNN.