Abstract:Offloading compute intensive nested loops to execute on FPGA accelerators have been demonstrated by numerous researchers as an effective performance enhancement technique across numerous application domains. To construct such accelerators with high design productivity, researchers have increasingly turned to the use of overlay architectures as an intermediate generation target built on top of off-the-shelf FPGAs. However, achieving the desired performance-overhead trade-off remains a major productivity challenge as complex application-specific customizations over a large design space covering multiple architectural parameters are needed. In this work, an automatic nested loop acceleration framework utilizing a regular soft coarse-grained reconfigurable array (SCGRA) overlay is presented. Given high-level resource constraints, the framework automatically customizes the overlay architectural design parameters, high-level compilation options as well as communication between the accelerator and the host processor for optimized performance specifically to the given application. In our experiments, at a cost of 10 to 20 minutes additional tools run time, the proposed customization process resulted in up to 5 times additional speedup over a baseline accelerator generated by the same framework without customization. Overall, when compared to the equivalent software running on the host ARM processor alone on the Zedboard, the resulting accelerators achieved up to 10 times speedup.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the design efficiency and performance of FPGA accelerators when dealing with computationally - intensive nested loops. Specifically, the authors focus on how to automate the design parameters of custom - made FPGA accelerators by using Soft Coarse - Grained Reconfigurable Array (SCGRA) overlays to achieve optimization for specific applications. In this process, the authors propose an automated nested - loop acceleration framework. This framework can automatically adjust the overlay architecture design parameters, high - level compilation options, and the communication between the accelerator and the host processor according to the given high - level resource constraints, thereby significantly improving the performance of the accelerator without sacrificing design productivity. The main contributions of the paper are as follows: 1. **Propose an automated nested - loop acceleration framework based on Soft Coarse - Grained Reconfigurable Array (SCGRA) overlays**, which can complete the customization of the accelerator within 10 to 20 minutes, about 100 times faster than the exhaustive search method. 2. **Achieve an additional acceleration of up to 5 times through automatic customization**, and the performance has been significantly improved compared with the baseline accelerator without customization. 3. **Overall, compared with the software running only on the host ARM processor, the generated accelerator achieves a maximum acceleration of 10 times**, which indicates that this framework can significantly improve the performance of FPGA accelerators while improving design productivity. The paper describes in detail the working process of the framework, including the accelerator generation path and the accelerator customization path, and verifies its effectiveness and efficiency through experiments. These experimental results not only show the performance advantages of customized accelerators over non - customized accelerators but also prove the wide applicability and flexibility of this framework in different benchmark tests.

Automatic Nested Loop Acceleration on FPGAs Using Soft CGRA Overlay

Automatic generation of efficient accelerators for reconfigurable hardware

OverGen: Improving FPGA Usability Through Domain-specific Overlay Generation.

A Soft Processor Overlay with Tightly-coupled FPGA Accelerator

DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration

A Dynamic Overlay Supporting Just-In-Time Assembly to Construct Customized Hardware Accelerators

A RISC-V-based FPGA Overlay to Simplify Embedded Accelerator Deployment

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

FDRA: A Framework for a Dynamically Reconfigurable Accelerator Supporting Multi-Level Parallelism

Optimizing Loop Operation and Dataflow in Fpga Acceleration of Deep Convolutional Neural Networks

Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays

AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

Improving the computational efficiency and flexibility of FPGA-based CNN accelerator through loop optimization

Exploiting Outer Loop Parallelism of Nested Loop on Coarse-Grained Reconfigurable Architectures

DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric

Improving HW/SW Adaptability for Accelerating CNNs on FPGAs Through A Dynamic/Static Co-Reconfiguration Approach

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

Hardware Abstractions and Hardware Mechanisms to Support Multi-Task Execution on Coarse-Grained Reconfigurable Arrays

The input-aware dynamic adaptation of area and performance for reconfigurable accelerator.