Abstract:Object detection is an important computer vision task with a wide range of applications, including autonomous driving, smart security, and other domains. However, the high computational requirements poses challenges on deploying object detection on resource-limited edge devices. Thus dedicated hardware accelerators are desired to delever improved performances on detection speed and latency. Post-processing is a key step in object detection. It involves intensive computation on the CPU or GPU. The non-maximum suppression (NMS) algorithm is the core of post-processing, which can eliminate redundant boxes belonging to the same object. However, NMS becomes a bottleneck for hardware acceleration due to its characteristics of multiple iterations and waiting for all predicted boxes to be generated. In this paper, we propose a novel hardware-friendly NMS algorithm for FPGA accelerator design. Our proposed algorithm alleviates the performance bottleneck of NMS by implementing the iterative algorithm into an efficient pipelined hardware circuit. We validate our algorithm on the VOC2007 dataset and show that it only brings 0.27% difference compared to the baseline NMS. Additional, the exponential function and sigmoid function are also extremely hardware-costly. To address this issue, we propose an approximate exponential function circuit to calculate the two functions with minimum logic cost and zero DSP cost. We deploy our post-processing accelerator on Xilinx’s Alveo U50 FPGA board. The final design achieves a end-to-end detection latency of 283us for YOLOv2 model, According to the user guide provided by Xilinx and Intel, we converted the logic resources of different implementations on the FPGA into LUT resources. After that, we compared the resource utilization of acceleration module in the current state-of-the-art object detection system deployed on Intel with ours. Compared with it, we consumed 13.5 × lower LUT resources and used much fewer DSP resources.

Design of a YOLO Model Accelerator Based on PYNQ Architecture

FPGA-based Object Detection Acceleration Architecture Design

Design and Implementation of YOLOv3-Tiny Accelerator Based on PYNQ-Z2 Heterogeneous Platform

Universal accelerator software and hardware collaborative design for YOLO algorithm

An FPGA-Based YOLOv5 Accelerator for Real-Time Industrial Vision Applications

End-to-end Acceleration of the YOLO Object Detection Framework on FPGA-only Devices

Sparse-YOLO: Hardware/Software Co-Design of an FPGA Accelerator for YOLOv2

A High-Performance YOLOV5 Accelerator for Object Detection with Near Sensor Intelligence.

A Method for Accelerating YOLO by Hybrid Computing Based on ARM and FPGA

A Low-Latency Hardware Accelerator for YOLO Object Detection Algorithms.

Reduced-Parameter YOLO-Like Object Detector Oriented to Resource-Constrained Platform

A dedicated hardware accelerator for real-time acceleration of YOLOv2

YOLO Acceleration Using FPGA Architecture

Enhancing FPGA-Based YOLO Object Detection: Multi-Bank Storage Optimization and Model Refinement for Real-Time Applications

System Integration and Optimization of AI Hardware Acceleration Architecture for Object Detection

Design Implementation of FPGA-Based Neural Network Acceleration

Hardware Acceleration for Object Detection using YOLOv5 Deep Learning Algorithm on Xilinx Zynq FPGA Platform

SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices

Design and implementation of FPGA-based deep learning object detection system

High Power-Efficient and Performance-Density FPGA Accelerator for CNN-Based Object Detection

Compression of YOLOX object detection network and deployment on FPGA