Abstract:Modern accelerators use hierarchical parallel programming models that enable massive multithreading within a processing element (PE), with multiple PEs per device driven by traditional processes. Batching is a technique for exposing PE-level parallelism in algorithms that have traditionally run on MPI processes or multiple threads within a single process. Opportunities for batching arise in, for example, kinetic discretizations of magnetized plasmas where collisions are advanced in velocity space at each spatial point independently. This paper builds on previous work on a high-performance, fully nonlinear, Landau collision operator by batching the linear solver, as well as batching the spatial point problems and adding new support for multiple grids for multiscale, multi-species problems. An anisotropic relaxation verification test that agrees well with previous published results and analytical models is presented. The performance results from NVIDIA A100 and AMD MI250X nodes are presented with hardware utilization analysis for each architecture. The entire implicit Landau operator time advance is implemented in Kokkos for performance portability, running entirely on the device and is available in the PETSc numerical library.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the performance and portability of the Landau collision operator on modern accelerator hardware. Specifically, the paper focuses on optimizing the performance of the linear solver through batching techniques and how to effectively support multiple grids in multi - scale and multi - species problems. The paper also explores the performance on different hardware architectures (such as NVIDIA A100 and AMD MI250X) and provides a detailed hardware utilization analysis. ### Main research contents: 1. **Application of batching techniques**: The paper exposes the parallelism at the Processing Element (PE) level through batching techniques, thereby improving the execution efficiency of the algorithm on the accelerator. This includes batching of spatial point problems and batching of the linear solver. 2. **Multi - grid support**: The paper introduces support for multi - grids to adapt to multi - scale and multi - species problems. This is especially important for magnetized plasma simulations because different species have large differences in thermal velocities and require different grid resolutions. 3. **Performance analysis**: The paper analyzes in detail the performance on NVIDIA A100 and AMD MI250X nodes, including hardware utilization analysis. These analyses help to understand the performance bottlenecks and optimization directions under different hardware architectures. 4. **Verification test**: The paper verifies the correctness and performance of the Landau collision operator through an anisotropic relaxation verification test. This test is based on the previous work of Hager et al. By initializing a deuterium plasma with different parallel and perpendicular temperatures, the process of its evolution to an equilibrium state is observed. ### Key techniques and methods: - **Batched linear solver**: Use batching techniques to optimize the linear solver, reduce the need for global synchronization, and improve parallel efficiency. - **Multi - grid adaptation**: Support multi - grids, allowing different species to use different grids, improving the flexibility of grid resolution. - **Performance portability**: Use the Kokkos programming model to achieve performance portability, ensuring that the code can run efficiently on different hardware platforms. - **Numerical library integration**: The entire implicit Landau operator time - stepping implementation runs entirely on the device and is integrated into the PETSc numerical library to provide high - performance computing support. ### Conclusion: By introducing batching techniques and multi - grid support, the paper significantly improves the performance and portability of the Landau collision operator on modern accelerator hardware. The results of performance analysis and verification tests show that these optimization methods exhibit good performance improvements on different hardware architectures. These achievements are of great significance for magnetized plasma simulations and other high - dimensional applications.

A performance portable, fully implicit Landau collision operator with batched linear solvers

Implementation of the moving particle semi-implicit method for free-surface flows on GPU clusters.

Parallelized Implementation of the Finite Particle Method for Explicit Dynamics in GPU

Rapid Exploration of Optimization Strategies on Advanced Architectures using TestSNAP and LAMMPS

Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels

Performance Portable Solid Mechanics via Matrix-Free $p$-Multigrid

Towards a platform-portable linear algebra backend for OpenFOAM

Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results

Optimization of Lattice Boltzmann Simulations on Heterogeneous Computers

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

Parthenon—a performance portable block-structured adaptive mesh refinement framework

Studying performance portability of LAMMPS across diverse GPU‐based platforms

Hybrid programming-model strategies for GPU offloading of electronic structure calculation kernels

A Study of Performance Portability in Plasma Physics Simulations

Portable GPU implementation of the WP-CCC ion-atom collisions code

Evaluating performance portability of five shared-memory programming models using a high-order unstructured CFD solver

Code Generation and Performance Engineering for Matrix-Free Finite Element Methods on Hybrid Tetrahedral Grids

A Massively Parallel Performance Portable Free-space Spectral Poisson Solver

Efficient Parallel Implementation of the Lattice Boltzmann Method on Large Clusters of Graphic Processing Units

A Flexible Patch-Based Lattice Boltzmann Parallelization Approach for Heterogeneous GPU-CPU Clusters

High Performance Optimizations For Nuclear Physics Code Mfdn On Knl