Abstract:Sparse triangular solve（SpTRSV） is a vital operation in preconditioners. In particular, in scientific computing program that solves partial differential equation systems iteratively, structured SpTRSV is a common type of issue and often a performance bottleneck that needs to be addressed by the scientific computing program. The commercial mathematical libraries tailored to the graphics processing unit（GPU） platform, represented by CUSPARSE, parallelize SpTRSV operations by level-scheduling methods. However, this method is weakened by time-consuming preprocessing and serious GPU thread idle when it is employed to deal with structured SpTRSV issues. This study proposes a parallel algorithm tailored to structured SpTRSV issues. The proposed algorithm leverages the special non-zero element distribution pattern of structured SpTRSV issues during task allocation to skip the preprocessing and analysis of the non-zero element structure of the input issue. Furthermore, the element-wise operation strategy used in the existing level-scheduling methods is modified. As a result, the problem of GPU thread idle is effectively alleviated, and the memory access latency of some non-zero elements in the matrix is concealed. This study also adopts a state variable compression technique according to the task allocation characteristics of the proposed algorithm, significantly improving the cache hit rate of the algorithm in state variable operations. Additionally, several hardware features of the GPU, including predicated execution, are investigated to comprehensively optimize algorithm implementation. The proposed algorithm is tested on NVIDIA V100 GPU, achieving an average 2.71× acceleration over CUSPARSE and a peak effective memory-access bandwidth of 225.2 GB/s. The modified element-wise operation strategy, combined with a series of other optimization measures for GPU hardware, attains a prominent optimization effect by yielding a nearly115% increase in the effective memory-access bandwidth of the proposed algorithm.

Fast Schedule Tensor Computation on GPU with High Data Reuse and Device Utilization

US-Byte: an Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

Dynamic Space-Time Scheduling for GPU Inference

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

cuFasterTucker: A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU Platform

RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain Utilization

Parallel Structured Sparse Triangular Solver for GPU Platform

Efficient GPU Spatial-Temporal Multitasking

A CPU-GPGPU Scheduler Based on Data Transmission Bandwidth of Workload

TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

FastFace: Fast-converging Scheduler for Large-scale Face Recognition Training with One GPU

High-Performance Tensor Learning Primitives Using GPU Tensor Cores

Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control

Large-Scale Fast Fourier Transform

FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUs

Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows

Hybrid CPU-GPU scheduling and execution of tree traversals

Graph Processing Scheme Using GPU With Value-Driven Differential Scheduling

A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs