Abstract:Deploying Deep Learning (DL) on embedded end devices is a scorching trend in pervasive computing. Since most Microcontrollers on embedded devices have limited computing power, it is necessary to add a DL accelerator. Embedded Field Programmable Gate Arrays (FPGAs) are suitable for deploying DL accelerators for embedded devices, but developing an energy-efficient DL accelerator on an FPGA is not easy. Therefore, we propose the ElasticAI-Workflow that aims to help DL developers to create and deploy DL models as hardware accelerators on embedded FPGAs. This workflow consists of two key components: the ElasticAI-Creator and the Elastic Node. The former is a toolchain for automatically generating DL accelerators on FPGAs. The latter is a hardware platform for verifying the performance of the generated accelerators. With this combination, the performance of the accelerator can be sufficiently guaranteed. We will demonstrate the potential of our approach through a case study.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of energy efficiency and computing power when deploying deep learning (DL) models on embedded devices. Specifically, due to the limited computing power of most embedded microcontrollers (MCUs), it becomes very difficult to directly run complex deep - learning models on these devices, especially in terms of meeting real - time requirements. To solve this problem, the author proposes a workflow named ElasticAI, which aims to help deep - learning developers create and deploy deep - learning accelerators that can run efficiently on embedded FPGAs. ### Main problems: 1. **Insufficient computing power**: Microcontrollers on embedded devices usually have limited computing power and cannot execute deep - learning models efficiently. 2. **Energy - efficiency requirements**: In order to prolong battery life and reduce energy consumption, it is necessary to ensure that the deep - learning accelerator has high energy efficiency. 3. **High development difficulty**: Designing a deep - learning accelerator suitable for embedded FPGAs requires in - depth FPGA engineering knowledge, which is a challenge for most deep - learning developers. 4. **Inaccurate power consumption measurement**: Existing hardware platforms can only provide overall power consumption measurements and cannot provide fine - grained power consumption data, making it difficult to guide optimization work. ### Main components of ElasticAI - Workflow: 1. **ElasticAI - Creator**: This is a tool chain that can help developers automatically convert deep - learning models in PyTorch into RTL representations suitable for FPGAs. This greatly reduces the requirement for FPGA expertise. 2. **Elastic Node**: This is a customized hardware platform used to verify the energy efficiency of the generated deep - learning accelerator in a real - environment. It can provide fine - grained power consumption measurements, thereby guiding optimization work. Through the combination of these two components, ElasticAI - Workflow can effectively help developers create and deploy efficient deep - learning accelerators while ensuring that their performance and energy efficiency meet application requirements. ### Key points of the solution: - **Automated conversion**: Through ElasticAI - Creator, developers can convert deep - learning models into hardware accelerators without in - depth knowledge of FPGAs. - **Fine - grained power consumption measurement**: Elastic Node provides detailed power consumption data, allowing developers to optimize according to the actual measurement results. - **Feedback loop**: The entire workflow includes multiple stages, and each stage can be optimized and adjusted according to the report until the performance and energy efficiency requirements are met. In summary, the core problem of this paper is to improve the energy efficiency and computing power of deep - learning models on embedded devices, and the proposed ElasticAI - Workflow provides a systematic solution to address these challenges.

ElasticAI: Creating and Deploying Energy-Efficient Deep Learning Accelerator for Pervasive Computing

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

DLAU: A Scalable Deep Learning Accelerator Unit on FPGA.

Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning

An Overview of FPGA Based Deep Learning Accelerators: Challenges and Opportunities.

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN Accelerators for Edge Inference

Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators

A Deep Learning Prediction Process Accelerator Based FPGA

CFU Playground: Full-Stack Open-Source Framework for Tiny Machine Learning (tinyML) Acceleration on FPGAs

A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

SigDLA: A Deep Learning Accelerator Extension for Signal Processing

Designing Deep Learning Hardware Accelerator and Efficiency Evaluation

An All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration

Embedded Streaming Deep Neural Networks Accelerator With Applications

Octopus: A Heterogeneous In-network Computing Accelerator Enabling Deep Learning for network

Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering Dataflow

Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud

Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator

Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence