T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations

Nitish Srivastava,Hongbo Rong,Prithayan Barua,Guanyu Feng,Huanqi Cao,Zhiru Zhang,David Albonesi,Vivek Sarkar,Wenguang Chen,Paul Petersen,Geoff Lowney,Adam Herr,Christopher Hughes,Timothy Mattson,Pradeep Dubey
DOI: https://doi.org/10.1109/fccm.2019.00033
2019-01-01
Abstract:We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time. We used this framework to implement several important dense tensor kernels. We implemented dense matrix multiply for an Arria-10 FPGA and a research CGRA, achieving 88% and 92% of the performance of manually written, and highly optimized expert (ninja") implementations in just 3% of their engineering time. Three other tensor kernels, including MTTKRP, TTM and TTMc, were also implemented with high performance and low design effort, and for the first time on spatial architectures."
What problem does this paper attempt to address?