Abstract:Object detection is an important computer vision task with a wide range of applications, including autonomous driving, smart security, and other domains. However, the high computational requirements poses challenges on deploying object detection on resource-limited edge devices. Thus dedicated hardware accelerators are desired to delever improved performances on detection speed and latency. Post-processing is a key step in object detection. It involves intensive computation on the CPU or GPU. The non-maximum suppression (NMS) algorithm is the core of post-processing, which can eliminate redundant boxes belonging to the same object. However, NMS becomes a bottleneck for hardware acceleration due to its characteristics of multiple iterations and waiting for all predicted boxes to be generated. In this paper, we propose a novel hardware-friendly NMS algorithm for FPGA accelerator design. Our proposed algorithm alleviates the performance bottleneck of NMS by implementing the iterative algorithm into an efficient pipelined hardware circuit. We validate our algorithm on the VOC2007 dataset and show that it only brings 0.27% difference compared to the baseline NMS. Additional, the exponential function and sigmoid function are also extremely hardware-costly. To address this issue, we propose an approximate exponential function circuit to calculate the two functions with minimum logic cost and zero DSP cost. We deploy our post-processing accelerator on Xilinx’s Alveo U50 FPGA board. The final design achieves a end-to-end detection latency of 283us for YOLOv2 model, According to the user guide provided by Xilinx and Intel, we converted the logic resources of different implementations on the FPGA into LUT resources. After that, we compared the resource utilization of acceleration module in the current state-of-the-art object detection system deployed on Intel with ours. Compared with it, we consumed 13.5 × lower LUT resources and used much fewer DSP resources.

A Method for Accelerating YOLO by Hybrid Computing Based on ARM and FPGA

YOLO Acceleration Using FPGA Architecture

An FPGA-Based Reconfigurable CNN Accelerator for YOLO

Sparse-YOLO: Hardware/Software Co-Design of an FPGA Accelerator for YOLOv2

A dedicated hardware accelerator for real-time acceleration of YOLOv2

End-to-end Acceleration of the YOLO Object Detection Framework on FPGA-only Devices

Design and Implementation of YOLOv3-Tiny Accelerator Based on PYNQ-Z2 Heterogeneous Platform

A CNN Hardware Accelerator Designed for YOLO Algorithm Based on RISC-V SoC

An FPGA-Based YOLOv5 Accelerator for Real-Time Industrial Vision Applications

Hardware Acceleration for Object Detection using YOLOv5 Deep Learning Algorithm on Xilinx Zynq FPGA Platform

Hardware Acceleration and Implementation of YOLOX-s for On-Orbit FPGA

Design of a YOLO Model Accelerator Based on PYNQ Architecture

Enhancing FPGA-Based YOLO Object Detection: Multi-Bank Storage Optimization and Model Refinement for Real-Time Applications

Design Implementation of FPGA-Based Neural Network Acceleration

Reduced-Parameter YOLO-Like Object Detector Oriented to Resource-Constrained Platform

A Scalable OpenCL-Based FPGA Accelerator for YOLOv2

A Low-Latency Hardware Accelerator for YOLO Object Detection Algorithms.

Universal accelerator software and hardware collaborative design for YOLO algorithm

LPYOLO: Low Precision YOLO for Face Detection on FPGA

Compression of YOLOX object detection network and deployment on FPGA

Lightweight Convolutional Neural Network of YOLO V3- Tiny Algorithm on FPGA for Target Detection