Abstract:Sparse Triangular Solve (SpTRSV) has long been an essential kernel in the field of scientific computing. Due to its low computational intensity and internal data dependencies, SpTRSV is hard to implement and optimize on GPUs. Based on our experimental observations, existing implementations on GPUs fail to achieve the optimal performance due to their sub-optimal parallelism setups and code implementations, and lack of consideration of the irregular data distribution. Moreover, their algorithm design lacks the adaptability to different input matrices, which may involve substantial manual efforts of algorithm redesigning and parameter tuning for performance consistency. In this work, we propose AG-SpTRSV, an automatic framework to optimize SpTRSV on GPUs, which provides high performance on various matrices while eliminating the costs of manual design. AG-SpTRSV abstracts the procedures of optimizing an SpTRSV kernel as a scheme and constructs a comprehensive optimization space based on it. By defining a unified code template and preparing code variants, AG-SpTRSV enables fine-grained dynamic parallelism and adaptive code optimizations to handle various tasks. Through computation graph transformation and multi-hierarchy heuristic scheduling, AG-SpTRSV generates schemes for task partitioning and mapping, which effectively address the issues of irregular data distribution and internal data dependencies. AG-SpTRSV searches for the best scheme to optimize the target kernel for the specific matrix. A learned lightweight performance model is also introduced to reduce search costs and provide an efficient end-to-end solution. Experimental results with SuiteSparse Matrix Collection on NVIDIA Tesla A100 and RTX 3080 Ti show that AG-SpTRSV outperforms state-of-the-art implementations with geometric average speedups of 2.12x ∼ 3.99x. With the performance model enabled, AG-SpTRSV can provide an efficient end-to-end solution, with preprocessing times ranging from 3.4 to 245 times of the execution time.

Optimizing Sparse Tensor Times Matrix on GPUs

A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-Acceleration

cuFasterTucker: A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU Platform

Sparse MTTKRP Acceleration for Tensor Decomposition on GPU

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores

High-Performance Tensor Learning Primitives Using GPU Tensor Cores

Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor Contraction

High-Performance Tensor-Train Primitives Using GPU Tensor Cores

Software for Sparse Tensor Decomposition on Emerging Computing Architectures

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

Scalable Tucker Factorization for Sparse Tensors - Algorithms and Discoveries

Sparse Tucker Tensor Decomposition on a Hybrid FPGA-CPU Platform

a-Tucker: Input-Adaptive and Matricization-Free Tucker Decomposition for Dense Tensors on CPUs and GPUs

High Performance Unstructured SpMM Computation Using Tensor Cores

AG-SpTRSV: an Automatic Framework to Optimize Sparse Triangular Solve on GPUs

SpTFS: Sparse Tensor Format Selection for MTTKRP Via Deep Learning

A-Tucker: Fast Input-Adaptive and Matricization-Free Tucker Decomposition of Higher-Order Tensors on GPUs

Tucker Tensor Decomposition on FPGA

Shared Memory Parallelization of MTTKRP for Dense Tensors

Input-aware Sparse Tensor Storage Format Selection for Optimizing MTTKRP

POSTER: Optimizing Sparse Tensor Contraction with Revisiting Hash Table Design.