Abstract:FPGA accelerators for lightweight neural convolutional networks (LWCNNs) have recently attracted significant attention. Most existing LWCNN accelerators focus on single-Computing-Engine (CE) architecture with local optimization. However, these designs typically suffer from high on-chip/off-chip memory overhead and low computational efficiency due to their layer-by-layer dataflow and unified resource mapping mechanisms. To tackle these issues, a novel multi-CE-based accelerator with balanced dataflow is proposed to efficiently accelerate LWCNN through memory-oriented and computing-oriented optimizations. Firstly, a streaming architecture with hybrid CEs is designed to minimize off-chip memory access while maintaining a low cost of on-chip buffer size. Secondly, a balanced dataflow strategy is introduced for streaming architectures to enhance computational efficiency by improving efficient resource mapping and mitigating data congestion. Furthermore, a resource-aware memory and parallelism allocation methodology is proposed, based on a performance model, to achieve better performance and scalability. The proposed accelerator is evaluated on Xilinx ZC706 platform using MobileNetV2 and <a class="link-external link-http" href="http://ShuffleNetV2.Implementation" rel="external noopener nofollow">this http URL</a> results demonstrate that the proposed accelerator can save up to 68.3% of on-chip memory size with reduced off-chip memory access compared to the reference design. It achieves an impressive performance of up to 2092.4 FPS and a state-of-the-art MAC efficiency of up to 94.58%, while maintaining a high DSP utilization of 95%, thus significantly outperforming current LWCNN accelerators.

A High-speed Low-cost CNN Inference Accelerator for Depthwise Separable Convolution

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator

Efficient Inference of Large-Scale and Lightweight Convolutional Neural Networks on FPGA

A High-performance Inference Accelerator Exploiting Patterned Sparsity in CNNs

Hardware Implementation of Depthwise Separable Convolution Neural Network

A Hardware Accelerator for Standard Convolution and Depthwise Convolution

A High Performance FPGA-based Accelerator for Large-Scale Convolutional Neural Networks

A 16.41 TOPS/W CNN Accelerator with Event-Based Layer Fusion for Real-Time Inference

A Flexible and Efficient FPGA Accelerator for Various Large-Scale and Lightweight CNNs

A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow

A High-Performance Pixel-Level Fully Pipelined Hardware Accelerator for Neural Networks

FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software

FPGA-based Accelerator for Convolutional Neural Network

An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes

High-Performance FPGA-Based CNN Accelerator with Block-Floating-Point Arithmetic.

An Efficient CNN Inference Accelerator Based on Intra- and Inter-Channel Feature Map Compression

A Power-Efficient and High Performance FPGA Accelerator for Convolutional Neural Networks: Work-in-progress.

An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs

A FPGA-based end-to-end acceleration framework for fast deployment of Convolutional Neural Networks

WPU: A FPGA-based Scalable, Efficient and Software/Hardware Co-design Deep Neural Network Inference Acceleration Processor