Abstract:This paper introduces a 16‐bit fixed‐point field‐programmable gate array (FPGA)‐based hardware accelerator for deep learning on a 32‐bit low‐memory edge device (PYNQ‐Z2 board). Singular value decomposition (SVD) optimizes the fully connected layer. The accelerator unit spans all five layers, leveraging eight processing elements for parallel computations in convolution layers 1 and 2. Techniques like array partitioning, loop unrolling, and pipelining enhance computation speed. The accelerator outperforms software‐based implementations by 89.03%, 86.12%, and 82.45% against INTEL 3‐core CPU, Haswell 2‐core CPU, and NVIDIA Tesla K80 GPU, respectively. Convolutional neural networks (CNNs) are now often used in deep learning and computer vision applications. Its convolutional layer accounts for most calculations and should be computed fast in a local edge device. Field‐programmable gate arrays (FPGAs) have been adequately explored as promising hardware accelerators for CNNs due to their high performance, energy efficiency, and reconfigurability. This paper developed an efficient FPGA‐based 16‐bit fixed‐point hardware accelerator unit for deep learning applications on the 32‐bit low‐memory edge device (PYNQ‐Z2 board). Additionally, singular value decomposition is applied to the fully connected layer for dimensionality reduction of weight parameters. The accelerator unit was designed for all five layers and employed eight processing elements in convolution layers 1 and 2 for parallel computations. In addition, array partitioning, loop unrolling, and pipelining are the techniques used to increase the speed of calculations. The AXI‐Lite interface was also used to communicate between IP and other blocks. Moreover, the design is tested with grayscale image classification on MNIST handwritten digit dataset and color image classification on the Tumor dataset. The experimental results show that the proposed accelerator unit implementation performs faster than the software‐based implementation. Its inference speed is 89.03% more than INTEL 3‐core CPU, 86.12% higher than Haswell 2‐core CPU, and 82.45% more than NVIDIA Tesla K80 GPU. Furthermore, the throughput of the proposed design is 4.33GOP/s, which is better than the conventional CNN accelerator architectures.

Efficient Deployment of Single Shot Multibox Detector Network on FPGAs

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

Algorithm-Hardware Co-Design of Single Shot Detector for Fast Object Detection on FPGAs

High Power-Efficient and Performance-Density FPGA Accelerator for CNN-Based Object Detection

FPGA-based Object Detection Acceleration Architecture Design

Instruction Driven Cross-layer CNN Accelerator for Fast Detection on FPGA

FPGA Implementation of Feature Detection Algorithm Based on High Level Synthesis

Reduced-Parameter YOLO-Like Object Detector Oriented to Resource-Constrained Platform

Towards High-accuracy and Real-time Two-stage Small Object Detection on FPGA

Design and implementation of FPGA-based deep learning object detection system

Efficient Hardware Post Processing of Anchor-Based Object Detection on FPGA

Algorithm-Hardware Co-Optimization for Energy-Efficient Drone Detection on Resource-Constrained FPGA

An FPGA Accelerator Design of Spiking Neural Network for Energy-Efficient Object Detection

A Reconfigurable Neural Network Processor With Tile-Grained Multicore Pipeline for Object Detection on FPGA

A Lightweight Detection Method for Remote Sensing Images and Its Energy-Efficient Accelerator on Edge Devices

Hardware acceleration of infrared small target detection based on FPGA

Optimized Acceleration of Single Shot Detection for Edge Computing Based-on FPGA

Network Structure Optimization and High-Efficiency Implementation of Skynet Based on FPGA

Empowering edge devices: FPGA‐based 16‐bit fixed‐point accelerator with SVD for CNN on 32‐bit memory‐limited systems

CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs

Object Detection Edge Performance Optimization on FPGA-Based Heterogeneous Multiprocessor Systems