A performance portable, fully implicit Landau collision operator with batched linear solvers

Mark F. Adams,Peng Wang,Jacob Merson,Kevin Huck,Matthew G. Knepley
2024-07-09
Abstract:Modern accelerators use hierarchical parallel programming models that enable massive multithreading within a processing element (PE), with multiple PEs per device driven by traditional processes. Batching is a technique for exposing PE-level parallelism in algorithms that have traditionally run on MPI processes or multiple threads within a single process. Opportunities for batching arise in, for example, kinetic discretizations of magnetized plasmas where collisions are advanced in velocity space at each spatial point independently. This paper builds on previous work on a high-performance, fully nonlinear, Landau collision operator by batching the linear solver, as well as batching the spatial point problems and adding new support for multiple grids for multiscale, multi-species problems. An anisotropic relaxation verification test that agrees well with previous published results and analytical models is presented. The performance results from NVIDIA A100 and AMD MI250X nodes are presented with hardware utilization analysis for each architecture. The entire implicit Landau operator time advance is implemented in Kokkos for performance portability, running entirely on the device and is available in the PETSc numerical library.
Plasma Physics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance and portability of the Landau collision operator on modern accelerator hardware. Specifically, the paper focuses on optimizing the performance of the linear solver through batching techniques and how to effectively support multiple grids in multi - scale and multi - species problems. The paper also explores the performance on different hardware architectures (such as NVIDIA A100 and AMD MI250X) and provides a detailed hardware utilization analysis. ### Main research contents: 1. **Application of batching techniques**: The paper exposes the parallelism at the Processing Element (PE) level through batching techniques, thereby improving the execution efficiency of the algorithm on the accelerator. This includes batching of spatial point problems and batching of the linear solver. 2. **Multi - grid support**: The paper introduces support for multi - grids to adapt to multi - scale and multi - species problems. This is especially important for magnetized plasma simulations because different species have large differences in thermal velocities and require different grid resolutions. 3. **Performance analysis**: The paper analyzes in detail the performance on NVIDIA A100 and AMD MI250X nodes, including hardware utilization analysis. These analyses help to understand the performance bottlenecks and optimization directions under different hardware architectures. 4. **Verification test**: The paper verifies the correctness and performance of the Landau collision operator through an anisotropic relaxation verification test. This test is based on the previous work of Hager et al. By initializing a deuterium plasma with different parallel and perpendicular temperatures, the process of its evolution to an equilibrium state is observed. ### Key techniques and methods: - **Batched linear solver**: Use batching techniques to optimize the linear solver, reduce the need for global synchronization, and improve parallel efficiency. - **Multi - grid adaptation**: Support multi - grids, allowing different species to use different grids, improving the flexibility of grid resolution. - **Performance portability**: Use the Kokkos programming model to achieve performance portability, ensuring that the code can run efficiently on different hardware platforms. - **Numerical library integration**: The entire implicit Landau operator time - stepping implementation runs entirely on the device and is integrated into the PETSc numerical library to provide high - performance computing support. ### Conclusion: By introducing batching techniques and multi - grid support, the paper significantly improves the performance and portability of the Landau collision operator on modern accelerator hardware. The results of performance analysis and verification tests show that these optimization methods exhibit good performance improvements on different hardware architectures. These achievements are of great significance for magnetized plasma simulations and other high - dimensional applications.