Abstract:This paper introduces a 16‐bit fixed‐point field‐programmable gate array (FPGA)‐based hardware accelerator for deep learning on a 32‐bit low‐memory edge device (PYNQ‐Z2 board). Singular value decomposition (SVD) optimizes the fully connected layer. The accelerator unit spans all five layers, leveraging eight processing elements for parallel computations in convolution layers 1 and 2. Techniques like array partitioning, loop unrolling, and pipelining enhance computation speed. The accelerator outperforms software‐based implementations by 89.03%, 86.12%, and 82.45% against INTEL 3‐core CPU, Haswell 2‐core CPU, and NVIDIA Tesla K80 GPU, respectively. Convolutional neural networks (CNNs) are now often used in deep learning and computer vision applications. Its convolutional layer accounts for most calculations and should be computed fast in a local edge device. Field‐programmable gate arrays (FPGAs) have been adequately explored as promising hardware accelerators for CNNs due to their high performance, energy efficiency, and reconfigurability. This paper developed an efficient FPGA‐based 16‐bit fixed‐point hardware accelerator unit for deep learning applications on the 32‐bit low‐memory edge device (PYNQ‐Z2 board). Additionally, singular value decomposition is applied to the fully connected layer for dimensionality reduction of weight parameters. The accelerator unit was designed for all five layers and employed eight processing elements in convolution layers 1 and 2 for parallel computations. In addition, array partitioning, loop unrolling, and pipelining are the techniques used to increase the speed of calculations. The AXI‐Lite interface was also used to communicate between IP and other blocks. Moreover, the design is tested with grayscale image classification on MNIST handwritten digit dataset and color image classification on the Tumor dataset. The experimental results show that the proposed accelerator unit implementation performs faster than the software‐based implementation. Its inference speed is 89.03% more than INTEL 3‐core CPU, 86.12% higher than Haswell 2‐core CPU, and 82.45% more than NVIDIA Tesla K80 GPU. Furthermore, the throughput of the proposed design is 4.33GOP/s, which is better than the conventional CNN accelerator architectures.

Exploring In-Memory Accelerators and FPGAs for Latency-Sensitive DNN Inference on Edge Servers

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

Towards Memory-Efficient Inference in Edge Video Analytics

Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence

Sustainable AI Processing at the Edge

Optimizing Neural Network Inference in Edge Robotics by Harnessing FPGA Hardware Acceleration

3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low BitwidthQuantization, and Ultra-Low Latency Acceleration

Dynamic Performance and Power Optimization with Heterogeneous Processing-in-Memory for AI Applications on Edge Devices

Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud

AI on the Edge: Rethinking AI-based IoT Applications Using Specialized Edge Architectures

A High-Performance Accelerator for Real-Time Super-Resolution on Edge FPGAs

Latency optimized Deep Neural Networks (DNNs): An Artificial Intelligence approach at the Edge using Multiprocessor System on Chip (MPSoC)

Accelerating Mobile Applications at the Network Edge with Software-Programmable FPGAs.

Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator

An Efficient Lightweight CNN Acceleration Architecture for Edge Computing Based-on FPGA

Low-Power Ultra-Small Edge AI Accelerators for Image Recognition with Convolution Neural Networks: Analysis and Future Directions

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

New paradigm of FPGA-based computational intelligence from surveying the implementation of DNN accelerators

Empowering edge devices: FPGA‐based 16‐bit fixed‐point accelerator with SVD for CNN on 32‐bit memory‐limited systems

Containing Analog Data Deluge at Edge through Frequency-Domain Compression in Collaborative Compute-in-Memory Networks

Benchmarking Edge AI Platforms for High-Performance ML Inference