Abstract:With the rapid development and continuous evolution of convolutional neural networks (CNNs), FPGAs have become one of the most attractive candidates for deploying CNNs due to their re-programmability, low power consumption, and fast time-to-market characteristics. However, as the network structure of models deepens, previous FPGA solutions based on the traditional convolution are still limited by computational power, making it challenging to meet feedforward performance requirements. In this article, we introduce the state-of-the-art octave convolution (OctConv) into the CNN accelerator design for the first time to improve the hardware acceleration efficiency and design a dedicated OctPU for mapping OctConv to FPGAs efficiently, which employs a parallel dataflow pattern to exploit the parallelism of OctConv sufficiently. Based on this, we present a novel and scalable architecture that dynamically combines the inter-layer pipelined structure and multi-layer reuse structure, achieving a compromise between specificity and scalability with limited resources. Meanwhile, to obtain the optimized solution from the complex design space search, we build a multidimensional performance and resource analysis model and a two-stage search algorithm based on greedy and heuristic algorithms. We evaluate our proposal by implementing VGG16 and ResNet50 on the Xilinx VU9P FPGA. Experimental results show that our prototype accelerators can achieve an average of 3321 GOP/s for the convolutional layers for VGG16 and 2873 GOP/s for the overall ResNet50 using OctConv. Compared to previous works based on the traditional convolution, our prototypes own a 1.72× to 2.33× speedup in throughput and a 2.01× to 5.18× improvement in computational density. Our design also presents an excellent compromise performance and generalization compared to previous hardware and software co-optimization works.

A High-efficiency FPGA-based Accelerator for Convolutional Neural Networks using Winograd Algorithm

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs

Instruction driven cross-layer CNN accelerator with winograd transformation on FPGA

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs

WinoCNN: Kernel Sharing Winograd Systolic Array for Efficient Convolutional Neural Network Acceleration on FPGAs

Accelerating convolutional neural networks on FPGAs

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

A Scalable FPGA Accelerator for Convolutional Neural Networks.

A High Performance FPGA-based Accelerator for Large-Scale Convolutional Neural Networks

WinoNN: Optimizing FPGA-Based Convolutional Neural Network Accelerators Using Sparse Winograd Algorithm

A High-Performance Accelerator for Large-Scale Convolutional Neural Networks

Optimizing Convolutional Neural Network Accelerator on Low-Cost FPGA

OctCNN: A High Throughput FPGA Accelerator for CNNs using Octave Convolution Algorithm

A Power-Efficient and High Performance FPGA Accelerator for Convolutional Neural Networks: Work-in-progress.

BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization

Efficient CNN Accelerator on FPGA

Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks (CNNs) on FPGAs

Throughput-Optimized Opencl-Based Fpga Accelerator For Large-Scale Convolutional Neural Networks

Spwa: An Efficient Sparse Winograd Convolutional Neural Networks Accelerator On Fpgas

An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes