T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations
Nitish Srivastava,Hongbo Rong,Prithayan Barua,Guanyu Feng,Huanqi Cao,Zhiru Zhang,David Albonesi,Vivek Sarkar,Wenguang Chen,Paul Petersen,Geoff Lowney,Adam Herr,Christopher Hughes,Timothy Mattson,Pradeep Dubey
DOI: https://doi.org/10.1109/fccm.2019.00033
2019-01-01
Abstract:We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time. We used this framework to implement several important dense tensor kernels. We implemented dense matrix multiply for an Arria-10 FPGA and a research CGRA, achieving 88% and 92% of the performance of manually written, and highly optimized expert (ninja") implementations in just 3% of their engineering time. Three other tensor kernels, including MTTKRP, TTM and TTMc, were also implemented with high performance and low design effort, and for the first time on spatial architectures."