FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning
Kai Zhong,Zhenhua Zhu,Guohao Dai,Hongyi Wang,Xinhao Yang,Haoyu Zhang,Jin Si,Qiuli Mao,Shulin Zeng,Ke Hong,Genghan Zhang,Huazhong Yang,Yu Wang
DOI: https://doi.org/10.1145/3620666.3651336
2024-01-01
Abstract:Recently, sparse tensor algebra (SpTA) plays an increasingly important role in machine learning. However, due to the unstructured sparsity of SpTA, the general-purpose processors (e.g., GPU and CPU) are inefficient because of the underutilized hardware resources. Sparse kernel accelerators are optimized for specific tasks. However, their dedicated processing units and data paths cannot effectively support other SpTA tasks with different dataflow and various sparsity, resulting in performance degradation. This paper proposes FEASTA, a Flexible and Efficient Accelerator for Sparse Tensor Algebra. To process general SpTA tasks with various sparsity efficiently, we design FEASTA meticulously from three levels. At the dataflow abstraction level, we apply the Einstein Summation on the sparse fiber tree data structure to model the unified execution flow of general SpTA as joining and merging the fiber tree. At the instruction set architecture (ISA) level, a general SpTA ISA is proposed based on the execution flow. It includes different types of instructions for dense and sparse data, achieving flexibility and efficiency at the instruction level. At the architecture level, an instruction-driven architecture consisting of configurable and high-performance function units is designed, supporting the flexible and efficient ISA. Evaluations show that FEASTA has 5.40× geomean energy efficiency improvements compared to GPU among various workloads. FEASTA delivers 1.47× and 3.19× higher performance on sparse matrix multiplication kernels compared to state-of-the-art sparse matrix accelerator and CPU extension. Across diverse kernels, FEASTA achieves 1.69-12.70× energy efficiency over existing architectures.