Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator

Federico Nicolas Peccia,Svetlana Pavlitska,Tobias Fleck,Oliver Bringmann
2024-08-14
Abstract:The growing concerns regarding energy consumption and privacy have prompted the development of AI solutions deployable on the edge, circumventing the substantial CO2 emissions associated with cloud servers and mitigating risks related to sharing sensitive data. But deploying Convolutional Neural Networks (CNNs) on non-off-the-shelf edge devices remains a complex and labor-intensive task. In this paper, we present and end-to-end workflow for deployment of CNNs on Field Programmable Gate Arrays (FPGAs) using the Gemmini accelerator, which we modified for efficient implementation on FPGAs. We describe how we leverage the use of open source software on each optimization step of the deployment process, the customizations we added to them and its impact on the final system's performance. We were able to achieve real-time performance by deploying a YOLOv7 model on a Xilinx ZCU102 FPGA with an energy efficiency of 36.5 GOP/s/W. Our FPGA-based solution demonstrates superior power efficiency compared with other embedded hardware devices, and even outperforms other FPGA reference implementations. Finally, we present how this kind of solution can be integrated into a wider system, by testing our proposed platform in a traffic monitoring scenario.
Hardware Architecture,Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the complexity and time - consuming problems encountered when deploying Convolutional Neural Networks (CNNs) to edge devices, especially for non - off - the - shelf hardware such as Field - Programmable Gate Arrays (FPGAs). Specifically, the authors propose an end - to - end workflow for efficiently deploying CNNs on FPGAs and optimize it using the Gemmini accelerator. The following are the main objectives of this research: 1. **Simplify the deployment process**: Existing commercial edge devices (such as the NVIDIA Jetson series) provide software frameworks for optimizing model deployment, but for custom - made hardware accelerators (such as FPGAs), deploying pre - trained CNNs is still a manual and time - consuming task. This paper proposes an FPGA - based end - to - end workflow, which simplifies this process. 2. **Improve energy efficiency and performance**: By optimizing the implementation of the Gemmini accelerator on the FPGA, the authors achieve higher energy efficiency and real - time performance. For example, when they deploy the YOLOv7 model on the Xilinx ZCU102 FPGA, they achieve an energy efficiency of 36.5 GOP/s/W. 3. **Optimize the utilization of hardware resources**: The authors reduce the consumption of FPGA resources through a series of optimization measures (such as DSP packing technology, disabling unnecessary modules, etc.), making the design more compact and efficient. 4. **Support the complete CNN deployment**: Different from other FPGA accelerator works that only focus on feature extraction or convolutional layers, the solution proposed in this paper covers the entire CNN, including the post - processing part (such as the NMS algorithm), and explores how to allocate tasks in a heterogeneous SoC architecture to maximize performance. 5. **Integrate into a larger system**: Finally, the authors show how to integrate this solution into a broader application scenario, such as a traffic monitoring system, thereby verifying its practical application value. ### Key contributions of the paper - Proposed an end - to - end workflow for efficiently deploying CNNs on FPGAs. - Optimized the implementation of the Gemmini accelerator on the FPGA, significantly improving energy efficiency and performance. - Demonstrated how to allocate different parts of the CNN in a heterogeneous SoC architecture to achieve the best performance. - Verified the feasibility of this solution in practical application scenarios (such as traffic monitoring). Through these efforts, this research provides a feasible solution for efficiently deploying deep - learning models on edge devices, especially in application scenarios with strict requirements for energy efficiency and privacy.