Abstract:The International Journal of High Performance Computing Applications, Ahead of Print. We present the GPU implementation efforts and challenges of the sparse solver package STRUMPACK. The code is made publicly available on github with a permissive BSD license. STRUMPACK implements an approximate multifrontal solver, a sparse LU factorization which makes use of compression methods to accelerate time to solution and reduce memory usage. Multiple compression schemes based on rank-structured and hierarchical matrix approximations are supported, including hierarchically semi-separable, hierarchically off-diagonal butterfly, and block low rank. In this paper, we present the GPU implementation of the block low rank (BLR) compression method within a multifrontal solver. Our GPU implementation relies on highly optimized vendor libraries such as cuBLAS and cuSOLVER for NVIDIA GPUs, rocBLAS and rocSOLVER for AMD GPUs and the Intel oneAPI Math Kernel Library (oneMKL) for Intel GPUs. Additionally, we rely on external open source libraries such as SLATE (Software for Linear Algebra Targeting Exascale), MAGMA (Matrix Algebra on GPU and Multi-core Architectures), and KBLAS (KAUST BLAS). SLATE is used as a GPU-capable ScaLAPACK replacement. From MAGMA we use variable sized batched dense linear algebra operations such as GEMM, TRSM and LU with partial pivoting. KBLAS provides efficient (batched) low rank matrix compression for NVIDIA GPUs using an adaptive randomized sampling scheme. The resulting sparse solver and preconditioner runs on NVIDIA, AMD and Intel GPUs. Interfaces are available from PETSc, Trilinos and MFEM, or the solver can be used directly in user code. We report results for a range of benchmark applications, using the Perlmutter system from NERSC, Frontier from ORNL, and Aurora from ALCF. For a high frequency wave equation on a regular mesh, using 32 Perlmutter compute nodes, the factorization phase of the exact GPU solver is about 6.5× faster compared to the CPU-only solver. The BLR-enabled GPU solver is about 13.8× faster than the CPU exact solver. For a collection of SuiteSparse matrices, the STRUMPACK exact factorization on a single GPU is on average 1.9× faster than NVIDIA's cuDSS solver.

SFLU: Synchronization-Free Sparse LU Factorization for Fast Circuit Simulation on GPUs

Sparse LU Factorization for Parallel Circuit Simulation on GPU

GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling

Nonzero Pattern Analysis and Memory Access Optimization in GPU-based Sparse LU Factorization for Circuit Simulation

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

Accelerating Large-Scale Sparse LU Factorization for RF Circuit Simulation.

An Adaptive Lu Factorization Algorithm For Parallel Circuit Simulation

Fpga Accelerated Parallel Sparse Matrix Factorization For Circuit Simulations

A New Hybrid GPU-CPU Sparse LDLT Factorization Algorithm with GPU and CPU Factorizing Concurrently

Sparse matrix LU decomposition method based on GPU

NICSLU: An Adaptive Sparse Matrix Solver for Parallel Circuit Simulation

Batched sparse direct solver design and evaluation in SuperLU_DIST

NUMA-aware parallel sparse LU factorization for SPICE-based circuit simulators on ARM multi-core processors

A New Hybrid GPU-CPU Sparse LDL T Factorization Algorithm with GPU and CPU Factorizing Concurrently

An EScheduler-Based Data Dependence Analysis and Task Scheduling for Parallel Circuit Simulation

A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal LU Factorization

Mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint

Parallel Sparse LU Decomposition Using FPGA with an Efficient Cache Architecture.

GPU Accelerated Sparse Cholesky Factorization

Parallel Circuit Simulation on Multi/Many-core Systems.