Abstract:The International Journal of High Performance Computing Applications, Ahead of Print. We present the GPU implementation efforts and challenges of the sparse solver package STRUMPACK. The code is made publicly available on github with a permissive BSD license. STRUMPACK implements an approximate multifrontal solver, a sparse LU factorization which makes use of compression methods to accelerate time to solution and reduce memory usage. Multiple compression schemes based on rank-structured and hierarchical matrix approximations are supported, including hierarchically semi-separable, hierarchically off-diagonal butterfly, and block low rank. In this paper, we present the GPU implementation of the block low rank (BLR) compression method within a multifrontal solver. Our GPU implementation relies on highly optimized vendor libraries such as cuBLAS and cuSOLVER for NVIDIA GPUs, rocBLAS and rocSOLVER for AMD GPUs and the Intel oneAPI Math Kernel Library (oneMKL) for Intel GPUs. Additionally, we rely on external open source libraries such as SLATE (Software for Linear Algebra Targeting Exascale), MAGMA (Matrix Algebra on GPU and Multi-core Architectures), and KBLAS (KAUST BLAS). SLATE is used as a GPU-capable ScaLAPACK replacement. From MAGMA we use variable sized batched dense linear algebra operations such as GEMM, TRSM and LU with partial pivoting. KBLAS provides efficient (batched) low rank matrix compression for NVIDIA GPUs using an adaptive randomized sampling scheme. The resulting sparse solver and preconditioner runs on NVIDIA, AMD and Intel GPUs. Interfaces are available from PETSc, Trilinos and MFEM, or the solver can be used directly in user code. We report results for a range of benchmark applications, using the Perlmutter system from NERSC, Frontier from ORNL, and Aurora from ALCF. For a high frequency wave equation on a regular mesh, using 32 Perlmutter compute nodes, the factorization phase of the exact GPU solver is about 6.5× faster compared to the CPU-only solver. The BLR-enabled GPU solver is about 13.8× faster than the CPU exact solver. For a collection of SuiteSparse matrices, the STRUMPACK exact factorization on a single GPU is on average 1.9× faster than NVIDIA's cuDSS solver.

Accelerating Pqmrcgstab Algorithm On Gpu

GPU Based Two-Level CMFD Accelerating Two-Dimensional MOC Neutron Transport Calculation

405 Acceleration of MARS by Using GPU

CUDA-based PCG algorithm optimization for a large sparse matrix

High Performance Computing Via a GPU

Study on Acceleration of Three-Dimensional Method of Characteristics by GPU

Performance Acceleration of Kernel Polynomial Method Applying Graphics Processing Units

A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression

Generalized Gpu Acceleration For Applications Employing Finite-Volume Methods

Optimizing sparse matrix-vector multiplication based on gpu

Using Graphics Processing Units to Accelerate Perturbation Monte Carlo Simulation in a Turbid Medium

Towards Accelerating Irregular EDA Applications with GPUs.

Accelerating a three-dimensional MOC calculation using GPU with CUDA and two-level GCMFD method

A New Sparse Matrix Vector Multiplication GPU Algorithm Designed for Finite Element Problems

CuQ-RTM: A CUDA-based Code Package for Stable and Efficient Q-compensated Reverse Time Migration

GPU optimization of material point methods

GPU-HADVPPM V1.0: a high-efficiency parallel GPU design of the piecewise parabolic method (PPM) for horizontal advection in an air quality model (CAMx V6.10)

GPU-Acceleration of Parallel Unconditionally Stable Group Explicit Finite Difference Method

Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm

Quantum Chemistry for Solvated Molecules on Graphical Processing Units (GPUs)using Polarizable Continuum Models

Accelerating Pcg Power/Ground Network Solver On Gpgpu