Abstract:Deep neural networks (DNNs), as the basis of object detection, will play a key role in the development of future autonomous systems with full autonomy. The autonomous systems have special requirements of real-time, energy-e cient implementations of DNNs on a power-budgeted system. Two research thrusts are dedicated to per- formance and energy e ciency enhancement of the inference phase of DNNs. The first one is model compression techniques while the second is e cient hardware implementations. Recent researches on extremely-low-bit CNNs such as binary neural network (BNN) and XNOR-Net replace the traditional oating point operations with bi- nary bit operations, signi cantly reducing memory bandwidth and storage requirement, whereas suffering non-negligible accuracy loss and waste of digital signal processing (DSP) blocks on FPGAs. To overcome these limitations, this paper proposes REQ-YOLO, a resource aware, systematic weight quantization framework for object detection, considering both algorithm and hardware resource aspects in object detection. We adopt the block-circulant matrix method and propose a heterogeneous weight quantization using Alternative Direction Method of Multipliers (ADMM), an e ective optimization technique for general, non-convex optimization problems. To achieve real-time, highly efficient implementations on FPGA, we present the detailed hardware implementation of block circulant matrices on CONV layers and de- velop an e cient processing element (PE) structure supporting the heterogeneous weight quantization, CONV data ow and pipelining techniques, design optimization, and a template-based automatic synthesis framework to optimally exploit hardware resource. Experimental results show that our proposed REQ-YOLO framework can signi cantly compress the YOLO model while introducing very small accuracy degradation. The related codes are here: https://github.com/Anonymous788/heterogeneous_ADMM_YOLO.

Hardware Implementation and Optimization of Tiny-YOLO Network.

A CNN Hardware Accelerator Designed for YOLO Algorithm Based on RISC-V SoC

Power Efficient Tiny Yolo CNN Using Reduced Hardware Resources Based on Booth Multiplier and WALLACE Tree Adders

Design and Implementation of YOLOv3-Tiny Accelerator Based on PYNQ-Z2 Heterogeneous Platform

An FPGA-Based Reconfigurable CNN Accelerator for YOLO

Differential Image-Based Scalable YOLOv7-Tiny Implementation for Clustered Embedded Systems

Exploring Hardware Friendly Bottleneck Architecture In Cnn For Embedded Computing Systems

A dedicated hardware accelerator for real-time acceleration of YOLOv2

REQ-YOLO

Efficient Hardware Architectures for Deep Convolutional Neural Network

YOLO Acceleration Using FPGA Architecture

Algorithm–Hardware Co-Optimization and Deployment Method for Field-Programmable Gate-Array-Based Convolutional Neural Network Remote Sensing Image Processing

Sparse-YOLO: Hardware/Software Co-Design of an FPGA Accelerator for YOLOv2

Rotating Kernel CNN Optimization for Efficient IoT Surveillance on Low-Power Devices

Enhancing FPGA-Based YOLO Object Detection: Multi-Bank Storage Optimization and Model Refinement for Real-Time Applications

FPGA Hardware Acceleration Design for Deep Learning

Lightweight Convolutional Neural Network of YOLO V3- Tiny Algorithm on FPGA for Target Detection

A Method for Accelerating YOLO by Hybrid Computing Based on ARM and FPGA

A Lightweight YOLOv5 Optimization of Coordinate Attention

A Low-Power Hardware Architecture for Real-Time CNN Computing