Abstract:Convolutional neural networks (CNNs) have recently broken many performance records in image recognition and object detection problems. The success of CNNs, to a great extent, is enabled by the fast scaling-up of the networks that learn from a huge volume of data. The deployment of big CNN models can be both computation-intensive and memory-intensive, leaving severe challenges to hardware implementations. In recent years, sparsification techniques that prune redundant connections in the networks while still retaining the similar accuracy emerge as promising solutions to alliterate the computation overheads associated with CNNs [1]. However, imposing sparsity in CNNs usually generates random network connections and thus, the irregular data access pattern results in poor data locality. The low computation efficiency of the sparse networks, which is caused by the incurred unbalance in computing resource consumption and low memory bandwidth usage, significantly offsets the theocratical reduction of the computation complexity and limits the execution scalability of CNNs on general- purpose architectures [2]. For instance, as an important computation kernel in CNNs – the sparse convoluation, is usually accelerated by using data compression schemes where only nonzero elements of the kernel weights are stored and sent to multiplication-accumulation computations (MACs) at runtime. However, the relevant executions on CPUs and GPUs reach only 0.1% to 10% of the system peak performance even designated software libraries are applied (e.g., MKL library for CPUs and cuSPARSE library for GPUs). Field programmable gate arrays (FPGAs) have been also extensively studied as an important hardware platform for CNN computations [3]. Different from general-purpose architectures, FPGA allows users to customize the functions and organization of the designed hardware in order to adapt various resource needs and data usage patterns. This characteristic, as we identified in this work, can be leveraged to effectively overcome the main challenges in the execution of sparse CNNs through close coordinations between software and hardware. In particular, the reconfigurability of FPGA helps to 1) better map the sparse CNN onto the hardware for improving computation parallelism and execution efficiency and 2) eliminate the computation cost associated with zero weights and enhance data reuse to alleviate the adverse impacts of the irregular data accesses. In this work, we propose a hardware-software co-design framework to address the above challenges in sparse CNN accelerations. First, we introduce a data locality-aware sparsification scheme that optimizes the structure of the sparse CNN during training phase to make it friendly for hardware mapping. Both memory allocation and data access regularization are considered in the optimization process. Second, we develop a distributed architecture composed of the customized processing elements (PEs) that enables high computation parallelism and data reuse rate of the compressed network. Moreover, a holistic sparse optimization is introduced to our design framework for hardware platforms with different requirement. We evaluate our proposed frame- work by executing AlexNet on Xilinx Zynq ZC706. Our FPGA accelerator obtains a processing power of 71.2 GOPS, corresponding to 271.6 GOPS on the dense CNN model. On average, our FPGA design runs 11.5× faster than a well- tuned CPU implementation on Intel Xeon E5-2630, and has 3.2× better energy efficiency over the GPU realization on Nvidia Pascal Titan X. Compared to state-of-the-art FPGA designs [4], our accelerator reduces the classification time by 2.1×, with

Pflow: An end-to-end heterogeneous acceleration framework for CNN inference on FPGAs

Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs

An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs

WPU: A FPGA-based Scalable, Efficient and Software/Hardware Co-design Deep Neural Network Inference Acceleration Processor

PipeCNN: An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks

HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

MALOC: A Fully Pipelined FPGA Accelerator for Convolutional Neural Networks with All Layers Mapped on Chip

A Power-Efficient and High Performance FPGA Accelerator for Convolutional Neural Networks: Work-in-progress.

An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs

H2PIPE: High throughput CNN Inference on FPGAs with High-Bandwidth Memory

An FPGA Design Framework for CNN Sparsification and Acceleration

An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration

A FPGA-based end-to-end acceleration framework for fast deployment of Convolutional Neural Networks

A High Performance FPGA-based Accelerator for Large-Scale Convolutional Neural Networks

A Flexible and Efficient FPGA Accelerator for Various Large-Scale and Lightweight CNNs

A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow

High-Performance Acceleration of 2-D and 3-D CNNs on FPGAs Using Static Block Floating Point

An algorithm/hardware co‐optimized method to accelerate CNNs with compressed convolutional weights on FPGA

FPGA-Based High-Throughput CNN Hardware Accelerator With High Computing Resource Utilization Ratio