Abstract:Object detection is an important computer vision task with a wide range of applications, including autonomous driving, smart security, and other domains. However, the high computational requirements poses challenges on deploying object detection on resource-limited edge devices. Thus dedicated hardware accelerators are desired to delever improved performances on detection speed and latency. Post-processing is a key step in object detection. It involves intensive computation on the CPU or GPU. The non-maximum suppression (NMS) algorithm is the core of post-processing, which can eliminate redundant boxes belonging to the same object. However, NMS becomes a bottleneck for hardware acceleration due to its characteristics of multiple iterations and waiting for all predicted boxes to be generated. In this paper, we propose a novel hardware-friendly NMS algorithm for FPGA accelerator design. Our proposed algorithm alleviates the performance bottleneck of NMS by implementing the iterative algorithm into an efficient pipelined hardware circuit. We validate our algorithm on the VOC2007 dataset and show that it only brings 0.27% difference compared to the baseline NMS. Additional, the exponential function and sigmoid function are also extremely hardware-costly. To address this issue, we propose an approximate exponential function circuit to calculate the two functions with minimum logic cost and zero DSP cost. We deploy our post-processing accelerator on Xilinx’s Alveo U50 FPGA board. The final design achieves a end-to-end detection latency of 283us for YOLOv2 model, According to the user guide provided by Xilinx and Intel, we converted the logic resources of different implementations on the FPGA into LUT resources. After that, we compared the resource utilization of acceleration module in the current state-of-the-art object detection system deployed on Intel with ours. Compared with it, we consumed 13.5 × lower LUT resources and used much fewer DSP resources.

A Low-Latency Hardware Accelerator for YOLO Object Detection Algorithms.

A High-Performance YOLOV5 Accelerator for Object Detection with Near Sensor Intelligence.

A dedicated hardware accelerator for real-time acceleration of YOLOv2

FPGA-based Object Detection Acceleration Architecture Design

An FPGA-Based YOLOv5 Accelerator for Real-Time Industrial Vision Applications

Reduced-Parameter YOLO-Like Object Detector Oriented to Resource-Constrained Platform

Hardware Acceleration for Object Detection using YOLOv5 Deep Learning Algorithm on Xilinx Zynq FPGA Platform

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

Design and implementation of FPGA-based deep learning object detection system

System Integration and Optimization of AI Hardware Acceleration Architecture for Object Detection

Hardware Acceleration and Implementation of YOLOX-s for On-Orbit FPGA

WGeod: A General and Efficient FPGA Accelerator for Object Detection

Edge Real-Time Object Detection and DPU-Based Hardware Implementation for Optical Remote Sensing Images

Efficient Hardware Post Processing of Anchor-Based Object Detection on FPGA

Resource- and Power-Efficient High-Performance Object Detection Inference Acceleration Using FPGA

An Efficient Real-Time Object Detection Framework on Resource-Constricted Hardware Devices via Software and Hardware Co-design

A CNN Hardware Accelerator Designed for YOLO Algorithm Based on RISC-V SoC

FPGA-Based Vehicle Detection and Tracking Accelerator

Object Detection Edge Performance Optimization on FPGA-Based Heterogeneous Multiprocessor Systems

SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices

Design and Implementation of YOLOv3-Tiny Accelerator Based on PYNQ-Z2 Heterogeneous Platform