Abstract:This paper explores the performance optimization of out-of-core (OOC) Cholesky factorization on shared-memory systems equipped with multiple GPUs. We employ fine-grained computational tasks to expose concurrency while creating opportunities to overlap data movement asynchronously with computations, especially when dealing with matrices that cannot fit on the GPU memory. We leverage the directed acyclic graph of the task-based Cholesky factorization and map it onto a static scheduler that promotes data reuse while supporting strategies for reducing data movement with the CPU host when the GPU memory is exhausted. The CPU-GPU interconnect may become the main performance bottleneck as the gap between the GPU execution rate and the traditional PCIe bandwidth continues to widen. While the surface-to-volume effect of compute-bound kernels partially mitigates the overhead of data motion, deploying mixed-precision (MxP) computations exacerbates the throughput discrepancy. Using static task scheduling, we evaluate the performance capabilities of the new ultra-fast NVIDIA chip interconnect technology, codenamed NVLink-C2C, that constitutes the backbone of the NVIDIA Grace Hopper Superchip (GH200), against a new four-precision (FP64/FP32/FP16/FP8) left-looking Cholesky factorization. We report the performance results of a benchmarking campaign on various NVIDIA GPU generations and interconnects. We highlight 20% performance superiority against cuSOLVER on a single GH200 with FP64 while hiding the cost of OOC task-based Cholesky factorization, and we scale almost linearly on four GH200 superships. With MxP enabled, our statically scheduled four-precision tile-based Cholesky factorization scores a 3X performance speedup against its FP64-only counterpart, delivering application-worthy FP64 accuracy when modeling a large-scale geospatial statistical application.

Sparse Cholesky Factorization on FPGA Using Parameterized Model

Fpga Accelerated Parallel Sparse Matrix Factorization For Circuit Simulations

Accelerating Sparse Cholesky Factorization on Sunway Manycore Architecture.

GPU Accelerated Sparse Cholesky Factorization

A Hybrid CPU-GPU Multifrontal Optimizing Method in Sparse Cholesky Factorization

GPU Accelerated Parallel Cholesky Factorization

An Analytical Model for Domain-Specific Accelerator Deploying Sparse LU Factorization.

FPGA implementation for solving linear least square problem

GPU-based multifrontal optimizing method in sparse Cholesky factorization

Implementing LU and Cholesky factorizations on artificial intelligence accelerators

Accelerating Mixed-Precision Out-of-Core Cholesky Factorization with Static Task Scheduling

GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling

A Novel Fully Hardware-Implemented SVD Solver Based on Ultra-Parallel BCV Jacobi Algorithm

Parallel Cholesky Factorization for Banded Matrices using OpenMP Tasks

Sparse LU Factorization for Parallel Circuit Simulation on GPU

FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-art to Future Opportunities

FPGA Accelerator for CNN: an Exploration of the Kernel Structured Sparsity and Hybrid Arithmetic Computation

Optimizing the Performance of the Sparse Matrix-Vector Multiplication Kernel in FPGA Guided by the Roofline Model

Performance Modeling for FPGAs: Extending the Roofline Model with High-Level Synthesis Tools

An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs

Towards a Multi-array Architecture for Accelerating Large-scale Matrix Multiplication on FPGAs