Abstract:Pipelining is an effective technique to improve the performance of a loop by overlapping the execution of several iterations, particularly on the reconfigurable platform, which is more coarse-grained. In this paper, we use reconfigurable platform to accelerate loop based applications by reconstructing the pipeline structure during the execution of application. Based on this concept, the optimized strategies such as duplexing and splitting of function unit are applied from instruction level to task level. First, a loop is abstracted as a weighted data flow graph (WDFG), where nodes represent tasks while edges stand for inter-task dependencies. The weights of nodes and edges indicate task execution times and communication overheads respectively. Based on the abstraction, we propose an algorithm which automatically maps the pipelined loops onto reconfigurable hardware and select whether the duplexing or splitting is more appropriate. The algorithm is based on profiling information of WDFG, such as execution times and communication overheads. Then several test cases from EEMBC benchmark are selected to evaluate our approach. The evaluation is demonstrated in two ways. First, we operate some software simulations to appraise the effectiveness of the algorithms. Second, a prototype system is implemented on state-of-the-art FPGA board to evaluate the practicability of our approach on reconfigurable platform. Performance indicators of pipeline such as speedup, throughput and efficiency are measured in both ways. Moreover, in software simulation, the speedup and throughput rate of optimized pipeline achieved to 2 times at least and the efficiency increased by 1.1-1.5 times, whilst in hardware platform, the speedup and efficiency increase by 1.5 times due to the communication cost and reconfiguration delay, the throughput rate also increases by 1.5 to 2 times. Experimental results demonstrate that our approach can achieve satisfactory performance both on effectiveness and practicality.

A Fine-grained Pipelined Implementation of the LINPACK Benchmark on FPGAs

Design of Hardware Pipelining Processor

Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems

Parallel Sparse LU Decomposition Using FPGA with an Efficient Cache Architecture.

Revisiting Linpack Algorithm on Large-scale CPU-GPU Heterogeneous Systems

Hexagonal Tiling Based Multiple FPGAs Stencil Computation Acceleration and Optimization Methodology.

High-performance Placement Engine for Modern Large-scale FPGAs With Heterogeneity and Clock Constraints

Integrating FPGA-based hardware acceleration with relational databases

A Software Pipelining Based VLIW Architecture and Optimizing Compiler

Pipeline optimization for loops on reconfigurable platform

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA

Research of High-Speed Pipelined Floating-Point Multipfier Design

Software Pipelining for Graphic Processing Unit Acceleration: Partition, Scheduling and Granularity

Aggressive Pipelining of Irregular Applications on Reconfigurable Hardware

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

A Performance Analysis Framework For Optimizing Opencl Applications On Fpgas

Layout Driven FPGA Packing Algorithm for Performance Optimization

Improving Ilp Via Fused In-Order Superscalar And Vliw Instruction Dispatch Methods

Pflow: An end-to-end heterogeneous acceleration framework for CNN inference on FPGAs