POPA: Expressing High and Portable Performance Across Spatial and Vector Architectures for Tensor Computations

Xiaochen Hao,Hongbo Rong,Mingzhe Zhang,Ce Sun,Hong H. Jiang,Yun Liang
DOI: https://doi.org/10.1145/3626202.3637566
2024-01-01
Abstract:This paper aims at high and portable performance for tensor computations across spatial (e.g., FPGAs) and vector architectures (e.g., GPUs). The state-of-the-art usually address performance portability across vector architectures (CPUs and GPUs). However, they either miss FPGAs or do not achieve high performance. Without a common architectural abstraction, they program and optimize spatial and vector devices separately, causing low portability. We propose a unified programming framework, POPA, which achieves portability via architectural abstraction and performance via specialization. A parallel dataflow machine is proposed as a unified, abstract hardware target that hides differences of concrete architectures. The machine consists of software-defined systolic arrays and a tensor-specific cache hierarchy, which captures pipeline parallelism and customizable memories on FPGAs, as well as multithreading parallelism on GPUs. The machine is specified in a unified programming model as two dataflow graphs for scheduling compute and data movement, respectively. A compiler then specializes the abstract machine to exploit the properties of FPGAs and GPUs, bridging the gap between the abstract machine and a concrete architecture. We evaluate POPA on several Intel FPGAs and GPUs with high-profile tensor kernels, and this is the first system that achieves >=80% performance of expert-written code or machine peak across architectures, to the best of our knowledge.
What problem does this paper attempt to address?