Abstract:Object detection is an important computer vision task with a wide range of applications, including autonomous driving, smart security, and other domains. However, the high computational requirements poses challenges on deploying object detection on resource-limited edge devices. Thus dedicated hardware accelerators are desired to delever improved performances on detection speed and latency. Post-processing is a key step in object detection. It involves intensive computation on the CPU or GPU. The non-maximum suppression (NMS) algorithm is the core of post-processing, which can eliminate redundant boxes belonging to the same object. However, NMS becomes a bottleneck for hardware acceleration due to its characteristics of multiple iterations and waiting for all predicted boxes to be generated. In this paper, we propose a novel hardware-friendly NMS algorithm for FPGA accelerator design. Our proposed algorithm alleviates the performance bottleneck of NMS by implementing the iterative algorithm into an efficient pipelined hardware circuit. We validate our algorithm on the VOC2007 dataset and show that it only brings 0.27% difference compared to the baseline NMS. Additional, the exponential function and sigmoid function are also extremely hardware-costly. To address this issue, we propose an approximate exponential function circuit to calculate the two functions with minimum logic cost and zero DSP cost. We deploy our post-processing accelerator on Xilinx’s Alveo U50 FPGA board. The final design achieves a end-to-end detection latency of 283us for YOLOv2 model, According to the user guide provided by Xilinx and Intel, we converted the logic resources of different implementations on the FPGA into LUT resources. After that, we compared the resource utilization of acceleration module in the current state-of-the-art object detection system deployed on Intel with ours. Compared with it, we consumed 13.5 × lower LUT resources and used much fewer DSP resources.

A Low-Latency FPGA Implementation for Real-Time Object Detection

A Low-Latency Hardware Accelerator for YOLO Object Detection Algorithms.

End-to-end Acceleration of the YOLO Object Detection Framework on FPGA-only Devices

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

A dedicated hardware accelerator for real-time acceleration of YOLOv2

Apply Yolov4-Tiny on an FPGA-Based Accelerator of Convolutional Neural Network for Object Detection

Design and implementation of FPGA-based deep learning object detection system

High Power-Efficient and Performance-Density FPGA Accelerator for CNN-Based Object Detection

Enhancing FPGA-Based YOLO Object Detection: Multi-Bank Storage Optimization and Model Refinement for Real-Time Applications

FPGA-Based Vehicle Detection and Tracking Accelerator

Towards High-accuracy and Real-time Two-stage Small Object Detection on FPGA

An Efficient Real-Time Object Detection Framework on Resource-Constricted Hardware Devices via Software and Hardware Co-design

Sparse-YOLO: Hardware/Software Co-Design of an FPGA Accelerator for YOLOv2

An FPGA-Based YOLOv5 Accelerator for Real-Time Industrial Vision Applications

REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs

Reduced-Parameter YOLO-Like Object Detector Oriented to Resource-Constrained Platform

Lifting Based Object Detection Networks of Remote Sensing Imagery for FPGA Accelerator

WGeod: A General and Efficient FPGA Accelerator for Object Detection

Yolov3-tiny Object Detection SoC Based on FPGA Platform

FPGA-based Object Detection Acceleration Architecture Design

FPGA Implementation of a Deep Learning Acceleration Core Architecture for Image Target Detection