Abstract:Efficient utilization of restrained memory resources is of paramount importance in CPU-FPGA heterogeneous multiprocessor system-on-chip (HMPSoC)-based system design for memory-intensive applications. State-of-the-art high level synthesis (HLS) tools rely on the system programmers to manually determine the data placement within the complex memory hierarchy. Different data placement policies may lead to different system performance, and finding an optimal data placement policy is a nontrivial problem. For instance, we show counter-intuitive results that traditional frequency and locality-based data placement strategy designed for CPU architecture leads to nonoptimal system performance in CPU-FPGA HMPSoCs. In this work, we first propose an automatic data placement framework for field programmable gate array (FPGA) kernels to determine whether each array object should be accessed via the on-chip BRAM, shared CPU L2-cache, or DDR memory to achieve the optimal performance. Moreover, we find that when the CPU kernel and the FPGA kernel are executed in parallel, memory contentions may degrade the performance and the optimal data placement policy designed for the FPGA kernel alone will not achieve the optimal overall system performance. In this article, we proposed to use cache partitioning to alleviate the impact brought by memory contentions. We extend the framework designed for FPGA by adding the cross-layer memory contentions analysis to automatically generate an optimal data placement policy and cache partitioning mechanism for the parallel executing kernels. The proposed data placement framework can be seamlessly integrated with the commercial Vivado HLS. The experimental results on the Zedboard platform show an average performance speedup for FPGA kernels compared with a greedy-based allocation strategy. When FPGA kernels and CPU kernels are executed in parallel, the FPGA kernel and- the CPU kernel have a performance speedup of and on average, respectively.

Parallel Sparse LU Decomposition Using FPGA with an Efficient Cache Architecture.

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

Sparse matrix LU decomposition method based on GPU

Sparse LU Factorization for Parallel Circuit Simulation on GPU

Fpga Accelerated Parallel Sparse Matrix Factorization For Circuit Simulations

Efficient Memory Partitioning for Parallel Data Access in FPGA via Data Reuse

GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling

Parallel Sparse Left-Looking Algorithm

An Adaptive Lu Factorization Algorithm For Parallel Circuit Simulation

A Novel Fully Hardware-Implemented SVD Solver Based on Ultra-Parallel BCV Jacobi Algorithm

FPGA-Accelerated Compactions for LSM-based Key-Value Store.

NUMA-aware parallel sparse LU factorization for SPICE-based circuit simulators on ARM multi-core processors

A Comprehensive Memory Management Framework for CPU-FPGA Heterogenous SoCs

Nonzero Pattern Analysis and Memory Access Optimization in GPU-based Sparse LU Factorization for Circuit Simulation

A Hybrid Approach to Cache Management in Heterogeneous CPU-FPGA Platforms.

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

SFLU: Synchronization-Free Sparse LU Factorization for Fast Circuit Simulation on GPUs

L-FNNG: Accelerating Large-Scale KNN Graph Construction on CPU-FPGA Heterogeneous Platform

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity

Efficient Memory Partitioning For Parallel Data Access Via Data Reuse

An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs