Abstract:Sparse triangular solve（SpTRSV） is a vital operation in preconditioners. In particular, in scientific computing program that solves partial differential equation systems iteratively, structured SpTRSV is a common type of issue and often a performance bottleneck that needs to be addressed by the scientific computing program. The commercial mathematical libraries tailored to the graphics processing unit（GPU） platform, represented by CUSPARSE, parallelize SpTRSV operations by level-scheduling methods. However, this method is weakened by time-consuming preprocessing and serious GPU thread idle when it is employed to deal with structured SpTRSV issues. This study proposes a parallel algorithm tailored to structured SpTRSV issues. The proposed algorithm leverages the special non-zero element distribution pattern of structured SpTRSV issues during task allocation to skip the preprocessing and analysis of the non-zero element structure of the input issue. Furthermore, the element-wise operation strategy used in the existing level-scheduling methods is modified. As a result, the problem of GPU thread idle is effectively alleviated, and the memory access latency of some non-zero elements in the matrix is concealed. This study also adopts a state variable compression technique according to the task allocation characteristics of the proposed algorithm, significantly improving the cache hit rate of the algorithm in state variable operations. Additionally, several hardware features of the GPU, including predicated execution, are investigated to comprehensively optimize algorithm implementation. The proposed algorithm is tested on NVIDIA V100 GPU, achieving an average 2.71× acceleration over CUSPARSE and a peak effective memory-access bandwidth of 225.2 GB/s. The modified element-wise operation strategy, combined with a series of other optimization measures for GPU hardware, attains a prominent optimization effect by yielding a nearly115% increase in the effective memory-access bandwidth of the proposed algorithm.

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

Atomic Reduction Based Sparse Matrix-Transpose Vector Multiplication on GPUs

Compilation of Modular and General Sparse Workspaces

SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention

FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning

Optimizing sparse matrix-vector multiplication based on gpu

A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-Acceleration

Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor Contraction

Performance Optimization for Sparse A(T)Ax in Parallel on Multicore Cpu

TileSpMSpV: A Tiled Algorithm for Sparse Matrix-Sparse Vector Multiplication on GPUs

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication

T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations

Optimizing sparse general matrix–matrix multiplication for DCUs

Efficient Algorithm Design of Optimizing SpMV on GPU.

SpDISTAL: Compiling Distributed Sparse Tensor Computations

A New Sparse Matrix Vector Multiplication GPU Algorithm Designed for Finite Element Problems

Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUs

fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms

Parallel Structured Sparse Triangular Solver for GPU Platform

Sparse-HeteroCL: from Sparse Tensor Algebra to Highly Customized Accelerators on FPGAs.

TSCompiler: Efficient Compilation Framework for Dynamic-Shape Models