Abstract:Convolutional neural networks (CNNs) are to be effective in many application domains, especially in the computer vision area. In order to achieve lower latency CNN processing, and reduce power consumption, developers are experimenting with using FPGAs to accelerate CNN processing in several applications. Current FPGA CNN accelerators usually use the same acceleration approaches as GPUs, where operations from different network layers are mapped to the same hardware units working in a multiplexed manner. This will result in high flexibility in implementing different types of CNNs; however, this will degrade the latency that accelerators can achieve. Alternatively, we can reduce the latency of the accelerator by pipelining the processing of consecutive layers, at the expense of more FPGA resources. The continued increase in hardware resources available in FPGAs makes such implementations feasible for latency-critical application domains. In this paper, we present FPQNet, a fully pipelined and quantized CNN FPGA implementation that is channel-parallel, layer-pipelined, and network-parallel, to decrease latency and increase throughput, combined with quantization methods to optimize hardware utilization. In addition, we optimize this hardware architecture for the HDMI timing standard to avoid extra hardware utilization. This makes it possible for the accelerator to handle video datasets. We present prototypes of the FPQNet CNN network implementations on an Alpha Data 9H7 FPGA, connected with an OpenCAPI interface, to demonstrate architecture capabilities. Results show that with a 250 MHz clock frequency, an optimized LeNet-5 design is able to achieve latencies as low as 9.32 µs with an accuracy of 98.8% on the MNIST dataset, making it feasible for utilization in high frame rate video processing applications. With 10 hardware kernels working concurrently, the throughput is as high as 1108 GOPs. The methods in this paper are suitable for many other CNNs. Our analysis shows that the latency of AlexNet, ZFNet, OverFeat-Fast, and OverFeat-Accurate can be as low as 69.27, 66.95, 182.98, and 132.6 µs, using the architecture introduced in this paper, respectively.

HFOD: A Hardware-friendly Quantization Method for Object Detection on Embedded FPGAs

REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs

High Power-Efficient and Performance-Density FPGA Accelerator for CNN-Based Object Detection

Apply Yolov4-Tiny on an FPGA-Based Accelerator of Convolutional Neural Network for Object Detection

An FPGA-Based YOLOv5 Accelerator for Real-Time Industrial Vision Applications

A dedicated hardware accelerator for real-time acceleration of YOLOv2

FPGA-Based Hybrid-Type Implementation of Quantized Neural Networks for Remote Sensing Applications

Efficient Hardware Post Processing of Anchor-Based Object Detection on FPGA

End-to-end Acceleration of the YOLO Object Detection Framework on FPGA-only Devices

A hardware-friendly logarithmic quantization method for CNNs and FPGA implementation

FPQNet: Fully Pipelined and Quantized CNN for Ultra-Low Latency Image Classification on FPGAs Using OpenCAPI

FPGA Implementation for CNN-Based Optical Remote Sensing Object Detection

FPGA Implementation of a Deep Learning Acceleration Core Architecture for Image Target Detection

FPGA-based Object Detection Acceleration Architecture Design

Design Implementation of FPGA-Based Neural Network Acceleration

FPGA Implementation of Quantized Convolutional Neural Networks.

An Efficient FPGA-Based Implementation for Quantized Remote Sensing Image Scene Classification Network

Quantization-Based Optimization Algorithm for Hardware Implementation of Convolution Neural Networks

Custom Network Quantization Method for Lightweight CNN Acceleration on FPGAs

Automated flow for compressing convolution neural networks for efficient edge-computation with FPGA

A Low-Latency Hardware Accelerator for YOLO Object Detection Algorithms.