Abstract:Despite the increasing adoption of Field-Programmable Gate Arrays (FPGAs) in compute clouds, there remains a significant gap in programming tools and abstractions which can leverage network-connected, cloud-scale, multi-die FPGAs to generate accelerators with high frequency and throughput. To this end, we propose TAPA-CS, a task-parallel dataflow programming framework which automatically partitions and compiles a large design across a cluster of FPGAs with no additional user effort while achieving high frequency and throughput. TAPA-CS has three main contributions. First, it is an open-source framework which allows users to leverage virtually "unlimited" accelerator fabric, high-bandwidth memory (HBM), and on-chip memory, by abstracting away the underlying hardware. This reduces the user's programming burden to a logical one, enabling software developers and researchers with limited FPGA domain knowledge to deploy larger designs than possible earlier. Second, given as input a large design, TAPA-CS automatically partitions the design to map to multiple FPGAs, while ensuring congestion control, resource balancing, and overlapping of communication and computation. Third, TAPA-CS couples coarse-grained floorplanning with automated interconnect pipelining at the inter- and intra-FPGA levels to ensure high frequency. We have tested TAPA-CS on our multi-FPGA testbed where the FPGAs communicate through a high-speed 100Gbps Ethernet infrastructure. We have evaluated the performance and scalability of our tool on designs, including systolic-array based convolutional neural networks (CNNs), graph processing workloads such as page rank, stencil applications like the Dilate kernel, and K-nearest neighbors (KNN). TAPA-CS has the potential to accelerate development of increasingly complex and large designs on the low power and reconfigurable FPGAs.

PAAS: A system level simulator for heterogeneous computing architectures

An accelerator-aware microarchitecture simulator for design space exploration

A novel cross-layer framework for early-stage power delivery and architecture co-exploration.

A Hybrid Approach to Cache Management in Heterogeneous CPU-FPGA Platforms.

Centrifuge: Evaluating full-system HLS-generated heterogenous-accelerator SoCs using FPGA-Acceleration

Analyzing Parallelization and Program Performance in Heterogeneous MPSoCs

Scalable Light-Weight Integration of FPGA Based Accelerators with Chip Multi-Processors

Performance Modelling of Parallel and Distributed Computing Using PACE1

Design Space Exploration of HW Accelerators and Network Infrastructure for FPGA-Based MPSoC

CAMAS: Static and Dynamic Hybrid Cache Management for CPU-FPGA Platforms

gem5-NVDLA: A Simulation Framework for Compiling, Scheduling and Architecture Evaluation on AI System-on-Chips

Kernel-as-a-Service: A Serverless Programming Model for Heterogeneous Hardware Accelerators

Design Space Exploration of FPGA-based Accelerators with Multi-Level Parallelism

Modelling of ASCI High Performance Applications Using PACE

CAMASim: A Comprehensive Simulation Framework for Content-Addressable Memory based Accelerators

HosNa: A DPC++ Benchmark Suite for Heterogeneous Architectures

Design and Application Space Exploration of a Domain-Specific Accelerator System

Performance modeling of parallel and distributed computing using PACE

High-Performance Simultaneous Multiprocessing for Heterogeneous System-on-Chip

TAPA-CS: Enabling Scalable Accelerator Design on Distributed HBM-FPGAs

Performance monitoring for multicore embedded computing systems on FPGAs