Abstract:Recent progress in high-level synthesis (HLS) has helped raise the abstraction level of hardware design. HLS flows reduce designer effort by allowing development in a high-level language, which improves debugging, code reuse and ability to explore different implementation options. However, although the HLS process is fast, implementation and performance analysis still require lengthy logic synthesis and physical design. For design optimization, HLS tools require design space exploration to obtain parallelism at multiple levels of granularity including parallelism within a single HLS-generated core and parallelism between multiple instances of cores. Core interconnect and external bandwidth limitations can significantly impact feasible options in the design space. With many dimensions in a design space exploration, it quickly becomes infeasible to perform full logic synthesis and physical design for each possible design point. However, generation and evaluation of communications infrastructure as part of the exploration is critical to determine the system performance. Thus, in this paper, we extend the prior multilevel granularity parallelism exploration in the FCUDA HLS flow, which takes CUDA code as design input and generates a corresponding field programmable gate array implementation. Our framework performs an initial characterization of the application design space, then analytically explores the design space considering parallelism, core interconnect, and external memory bandwidth, and selects a pare-to-optimal set of designs. Our flow is completely automated to perform the exploration to characterize the analytical model, perform the exploration, select a solution, and integrate multiple instantiations of FCUDA cores via an advanced extensible interface bus interconnect. Our results demonstrate that this new FCUDA flow efficiently identifies and generates implementations with up to 5× improved system performance compared to single-level granularity parallelism (core-level optimization).

Integrated CUDA-to-FPGA Synthesis with Network-on-Chip

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-HB: Hierarchical and Scalable Bus Architecture Generation on FPGAs with the FCUDA Flow

High-level Synthesis of Multiple Dependent CUDA Kernels on FPGA

HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

Optimal Placement of Cores, Caches and Memory Controllers in Network On-Chip

A Star Network Approach in Heterogeneous Multiprocessors System on Chip

Model Parallelism Optimization for CNN FPGA Accelerator

Fpga Prototype Design Of Network On Chips

A Hybrid Approach to Cache Management in Heterogeneous CPU-FPGA Platforms.

Improving Communication Patterns in Polyhedral Process Networks

High Throughput Memory Data-Path Design For Multi-Core Architecture

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

Throughput-oriented Kernel Porting Onto FPGAs

A flexible FPGA accelerator for convolutional neural networks

A Comprehensive Memory Management Framework for CPU-FPGA Heterogenous SoCs

Extending High-Level Synthesis for Task-Parallel Programs

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

An OpenCL Framework for Distributed Apps on a Multidimensional Network of FPGAs

A Compilation Flow for the Generation of CNN Inference Accelerators on FPGAs

Multilevel Granularity Parallelism Synthesis on FPGAs