Abstract:The growing concerns regarding energy consumption and privacy have prompted the development of AI solutions deployable on the edge, circumventing the substantial CO2 emissions associated with cloud servers and mitigating risks related to sharing sensitive data. But deploying Convolutional Neural Networks (CNNs) on non-off-the-shelf edge devices remains a complex and labor-intensive task. In this paper, we present and end-to-end workflow for deployment of CNNs on Field Programmable Gate Arrays (FPGAs) using the Gemmini accelerator, which we modified for efficient implementation on FPGAs. We describe how we leverage the use of open source software on each optimization step of the deployment process, the customizations we added to them and its impact on the final system's performance. We were able to achieve real-time performance by deploying a YOLOv7 model on a Xilinx ZCU102 FPGA with an energy efficiency of 36.5 GOP/s/W. Our FPGA-based solution demonstrates superior power efficiency compared with other embedded hardware devices, and even outperforms other FPGA reference implementations. Finally, we present how this kind of solution can be integrated into a wider system, by testing our proposed platform in a traffic monitoring scenario.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the complexity and time - consuming problems encountered when deploying Convolutional Neural Networks (CNNs) to edge devices, especially for non - off - the - shelf hardware such as Field - Programmable Gate Arrays (FPGAs). Specifically, the authors propose an end - to - end workflow for efficiently deploying CNNs on FPGAs and optimize it using the Gemmini accelerator. The following are the main objectives of this research: 1. **Simplify the deployment process**: Existing commercial edge devices (such as the NVIDIA Jetson series) provide software frameworks for optimizing model deployment, but for custom - made hardware accelerators (such as FPGAs), deploying pre - trained CNNs is still a manual and time - consuming task. This paper proposes an FPGA - based end - to - end workflow, which simplifies this process. 2. **Improve energy efficiency and performance**: By optimizing the implementation of the Gemmini accelerator on the FPGA, the authors achieve higher energy efficiency and real - time performance. For example, when they deploy the YOLOv7 model on the Xilinx ZCU102 FPGA, they achieve an energy efficiency of 36.5 GOP/s/W. 3. **Optimize the utilization of hardware resources**: The authors reduce the consumption of FPGA resources through a series of optimization measures (such as DSP packing technology, disabling unnecessary modules, etc.), making the design more compact and efficient. 4. **Support the complete CNN deployment**: Different from other FPGA accelerator works that only focus on feature extraction or convolutional layers, the solution proposed in this paper covers the entire CNN, including the post - processing part (such as the NMS algorithm), and explores how to allocate tasks in a heterogeneous SoC architecture to maximize performance. 5. **Integrate into a larger system**: Finally, the authors show how to integrate this solution into a broader application scenario, such as a traffic monitoring system, thereby verifying its practical application value. ### Key contributions of the paper - Proposed an end - to - end workflow for efficiently deploying CNNs on FPGAs. - Optimized the implementation of the Gemmini accelerator on the FPGA, significantly improving energy efficiency and performance. - Demonstrated how to allocate different parts of the CNN in a heterogeneous SoC architecture to achieve the best performance. - Verified the feasibility of this solution in practical application scenarios (such as traffic monitoring). Through these efforts, this research provides a feasible solution for efficiently deploying deep - learning models on edge devices, especially in application scenarios with strict requirements for energy efficiency and privacy.

Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator

SparkNoC: An energy-efficiency FPGA-based accelerator using optimized lightweight CNN for edge computing

Automatic Deployment of Convolutional Neural Networks on FPGA for Spaceborne Remote Sensing Application

Optimizing Neural Network Inference in Edge Robotics by Harnessing FPGA Hardware Acceleration

CNN hardware acceleration on a low-power and low-cost APSoC

An Efficient Lightweight CNN Acceleration Architecture for Edge Computing Based-on FPGA

Scalable FPGA-Based Convolutional Neural Network Accelerator for Embedded Systems

Acceleration of Deep Neural Network Training Using Field Programmable Gate Arrays

Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs

Automated flow for compressing convolution neural networks for efficient edge-computation with FPGA

Survey of convolutional neural network accelerators on field-programmable gate array platforms: architectures and optimization techniques

A Lightweight Detection Method for Remote Sensing Images and Its Energy-Efficient Accelerator on Edge Devices

OctCNN: an Energy-Efficient FPGA Accelerator for CNNs Using Octave Convolution Algorithm

Efficient CNN Accelerator on FPGA

Towards Enabling Dynamic Convolution Neural Network Inference for Edge Intelligence

An Efficient Sparse CNNs Accelerator on FPGA

A Memory-Optimized and Energy-Efficient CNN Acceleration Architecture Based on FPGA.

A Power-Efficient and High Performance FPGA Accelerator for Convolutional Neural Networks: Work-in-progress.

Scalable and Modularized RTL Compilation of Convolutional Neural Networks Onto FPGA

A Scalable FPGA Accelerator for Convolutional Neural Networks.

Efficient Inference of Large-Scale and Lightweight Convolutional Neural Networks on FPGA