Batched sparse direct solver design and evaluation in SuperLU_DIST
Wajih Boukaram,Yuxi Hong,Yang Liu,Tianyi Shi,Xiaoye S Li
DOI: https://doi.org/10.1177/10943420241268200
2024-08-25
The International Journal of High Performance Computing Applications
Abstract:The International Journal of High Performance Computing Applications, Ahead of Print. Over the course of interactions with various application teams, the need for batched sparse linear algebra functions has emerged in order to make more efficient use of the GPUs for many small and sparse linear algebra problems. In this paper, we present our recent work on a batched sparse direct solver for GPUs. The sparse LU factorization is computed by the levels of the elimination tree, leveraging the batched dense operations at each level and a new batched Scatter GPU kernel. The sparse triangular solve is computed by the level sets of the directed acyclic graph (DAG) of the triangular matrix. Batched operations overcome the large overhead associated with launching many small kernels. For medium sized matrix batches with not-so-small bandwidth, using an NVIDIA A100 GPU, our new batched sparse direct solver is orders of magnitude faster than a batched banded solver and uses less than one-tenth of the memory.
computer science, theory & methods, interdisciplinary applications, hardware & architecture