Automatic Nested Loop Acceleration on FPGAs Using Soft CGRA Overlay

Cheng Liu,Ho-Cheung Ng,Hayden Kwok-Hay So
DOI: https://doi.org/10.48550/arXiv.1509.00042
2015-08-27
Abstract:Offloading compute intensive nested loops to execute on FPGA accelerators have been demonstrated by numerous researchers as an effective performance enhancement technique across numerous application domains. To construct such accelerators with high design productivity, researchers have increasingly turned to the use of overlay architectures as an intermediate generation target built on top of off-the-shelf FPGAs. However, achieving the desired performance-overhead trade-off remains a major productivity challenge as complex application-specific customizations over a large design space covering multiple architectural parameters are needed. In this work, an automatic nested loop acceleration framework utilizing a regular soft coarse-grained reconfigurable array (SCGRA) overlay is presented. Given high-level resource constraints, the framework automatically customizes the overlay architectural design parameters, high-level compilation options as well as communication between the accelerator and the host processor for optimized performance specifically to the given application. In our experiments, at a cost of 10 to 20 minutes additional tools run time, the proposed customization process resulted in up to 5 times additional speedup over a baseline accelerator generated by the same framework without customization. Overall, when compared to the equivalent software running on the host ARM processor alone on the Zedboard, the resulting accelerators achieved up to 10 times speedup.
Hardware Architecture,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the design efficiency and performance of FPGA accelerators when dealing with computationally - intensive nested loops. Specifically, the authors focus on how to automate the design parameters of custom - made FPGA accelerators by using Soft Coarse - Grained Reconfigurable Array (SCGRA) overlays to achieve optimization for specific applications. In this process, the authors propose an automated nested - loop acceleration framework. This framework can automatically adjust the overlay architecture design parameters, high - level compilation options, and the communication between the accelerator and the host processor according to the given high - level resource constraints, thereby significantly improving the performance of the accelerator without sacrificing design productivity. The main contributions of the paper are as follows: 1. **Propose an automated nested - loop acceleration framework based on Soft Coarse - Grained Reconfigurable Array (SCGRA) overlays**, which can complete the customization of the accelerator within 10 to 20 minutes, about 100 times faster than the exhaustive search method. 2. **Achieve an additional acceleration of up to 5 times through automatic customization**, and the performance has been significantly improved compared with the baseline accelerator without customization. 3. **Overall, compared with the software running only on the host ARM processor, the generated accelerator achieves a maximum acceleration of 10 times**, which indicates that this framework can significantly improve the performance of FPGA accelerators while improving design productivity. The paper describes in detail the working process of the framework, including the accelerator generation path and the accelerator customization path, and verifies its effectiveness and efficiency through experiments. These experimental results not only show the performance advantages of customized accelerators over non - customized accelerators but also prove the wide applicability and flexibility of this framework in different benchmark tests.