Abstract:Deploying deep learning models on embedded systems for computer vision tasks has been challenging due to limited compute resources and strict energy budgets. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, such as object detection, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the spatial variance of objects, and therefore, require specialized convolutions to aggregate spatial information. To address this need, recent work introduces dynamic deformable convolution to augment regular convolutions. Regular convolutions process a fixed grid of pixels across all the spatial locations in an image, while dynamic deformable convolution may access arbitrary pixels in the image with the access pattern being input-dependent and varying with spatial location. These properties lead to inefficient memory accesses of inputs with existing hardware. In this work, we harness the flexibility of FPGAs to develop a novel object detection pipeline with deformable convolutions. We show the speed-accuracy tradeoffs for a set of algorithm modifications including irregular-access versus limited-range and fixed-shape on a flexible hardware accelerator. We evaluate these algorithmic changes with corresponding hardware optimizations and show a 1.36x and 9.76x speedup respectively for the full and depthwise deformable convolution on hardware with minor accuracy loss. We then co-design a network called CoDeNet with the modified deformable convolution for object detection and quantize the network to 4-bit weights and 8-bit activations. With our high-efficiency implementation, our solution reaches 26.9 frames per second with a tiny model size of 0.76 MB while achieving 61.7 AP50 on the standard object detection dataset, Pascal VOC. With our higher-accuracy implementation, our model gets to 67.1 AP50 on Pascal VOC with only 2.9 MB of parameters--20.9x smaller but 10% more accurate than Tiny-YOLO.

Design and implementation of FPGA-based deep learning object detection system

Hardware Acceleration for Object Detection using YOLOv5 Deep Learning Algorithm on Xilinx Zynq FPGA Platform

FPGA-based Object Detection Acceleration Architecture Design

System Integration and Optimization of AI Hardware Acceleration Architecture for Object Detection

Reduced-Parameter YOLO-Like Object Detector Oriented to Resource-Constrained Platform

WGeod: A General and Efficient FPGA Accelerator for Object Detection

Edge Real-Time Object Detection and DPU-Based Hardware Implementation for Optical Remote Sensing Images

A dedicated hardware accelerator for real-time acceleration of YOLOv2

An Efficient Real-Time Object Detection Framework on Resource-Constricted Hardware Devices via Software and Hardware Co-design

Resource- and Power-Efficient High-Performance Object Detection Inference Acceleration Using FPGA

CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs

An FPGA Accelerator Design of Spiking Neural Network for Energy-Efficient Object Detection

Towards High-accuracy and Real-time Two-stage Small Object Detection on FPGA

FPGA-Based Vehicle Detection and Tracking Accelerator

An FPGA-Based YOLOv5 Accelerator for Real-Time Industrial Vision Applications

FPGA-SoC implementation of YOLOv4 for flying-object detection

A High-Performance YOLOV5 Accelerator for Object Detection with Near Sensor Intelligence.

Object Detection Edge Performance Optimization on FPGA-Based Heterogeneous Multiprocessor Systems

FPGA-Based Real-Time Object Detection and Classification System Using YOLO for Edge Computing

FPGA Implementation of Feature Detection Algorithm Based on High Level Synthesis

Design of embedded real-time human detection system based on Zynq platform