Abstract:Sparse triangular solve（SpTRSV） is a vital operation in preconditioners. In particular, in scientific computing program that solves partial differential equation systems iteratively, structured SpTRSV is a common type of issue and often a performance bottleneck that needs to be addressed by the scientific computing program. The commercial mathematical libraries tailored to the graphics processing unit（GPU） platform, represented by CUSPARSE, parallelize SpTRSV operations by level-scheduling methods. However, this method is weakened by time-consuming preprocessing and serious GPU thread idle when it is employed to deal with structured SpTRSV issues. This study proposes a parallel algorithm tailored to structured SpTRSV issues. The proposed algorithm leverages the special non-zero element distribution pattern of structured SpTRSV issues during task allocation to skip the preprocessing and analysis of the non-zero element structure of the input issue. Furthermore, the element-wise operation strategy used in the existing level-scheduling methods is modified. As a result, the problem of GPU thread idle is effectively alleviated, and the memory access latency of some non-zero elements in the matrix is concealed. This study also adopts a state variable compression technique according to the task allocation characteristics of the proposed algorithm, significantly improving the cache hit rate of the algorithm in state variable operations. Additionally, several hardware features of the GPU, including predicated execution, are investigated to comprehensively optimize algorithm implementation. The proposed algorithm is tested on NVIDIA V100 GPU, achieving an average 2.71× acceleration over CUSPARSE and a peak effective memory-access bandwidth of 225.2 GB/s. The modified element-wise operation strategy, combined with a series of other optimization measures for GPU hardware, attains a prominent optimization effect by yielding a nearly115% increase in the effective memory-access bandwidth of the proposed algorithm.

GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling

Sparse LU Factorization for Parallel Circuit Simulation on GPU

Nonzero Pattern Analysis and Memory Access Optimization in GPU-based Sparse LU Factorization for Circuit Simulation

An Adaptive Lu Factorization Algorithm For Parallel Circuit Simulation

SFLU: Synchronization-Free Sparse LU Factorization for Fast Circuit Simulation on GPUs

NUMA-aware parallel sparse LU factorization for SPICE-based circuit simulators on ARM multi-core processors

Parallel Circuit Simulation on Multi/Many-core Systems.

Fpga Accelerated Parallel Sparse Matrix Factorization For Circuit Simulations

NICSLU: An Adaptive Sparse Matrix Solver for Parallel Circuit Simulation

Sparse matrix LU decomposition method based on GPU

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

A Fast Parallel Sparse Solver for SPICE-based Circuit Simulators.

Sparsity-Oriented Sparse Solver Design For Circuit Simulation

An EScheduler-Based Data Dependence Analysis and Task Scheduling for Parallel Circuit Simulation

Accelerating Large-Scale Sparse LU Factorization for RF Circuit Simulation.

A New Hybrid GPU-CPU Sparse LDLT Factorization Algorithm with GPU and CPU Factorizing Concurrently

A New Sparse Matrix Vector Multiplication GPU Algorithm Designed for Finite Element Problems

Batched sparse direct solver design and evaluation in SuperLU_DIST

A New Hybrid GPU-CPU Sparse LDL T Factorization Algorithm with GPU and CPU Factorizing Concurrently

Parallel Structured Sparse Triangular Solver for GPU Platform

Acceleration for Timing-Aware Gate-Level Logic Simulation with One-Pass GPU Parallelism