Abstract:Batched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, there is a compelling demand to expand the capabilities of these solvers for Intel GPU architectures. In this paper, we present our efforts in porting and optimizing the batched iterative solvers on Intel GPUs using the SYCL programming model. These new solvers achieve impressive performance on the Intel GPU Max 1550s (Ponte Vecchio GPUs) which surpass our previous CUDA implementation on NVIDIA H100 GPUs by an average of 2.4x for the PeleLM application inputs. The batched solvers are ready for production use in real-world scientific applications through the Ginkgo library, complementing the performance portability of the batched functionality of Ginkgo.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to port batch - iterative solvers to Intel GPUs and optimize them using the SYCL programming model. Specifically, the authors are concerned with how to scale the capabilities of these solvers to adapt to the Intel GPU architecture on upcoming supercomputers equipped with Intel GPUs, such as Aurora. Through this work, they hope to improve the performance of these solvers on Intel GPUs, especially when dealing with small and medium - sized sparse problems in fields such as plasma physics and combustion simulation. ### Background and Motivation of the Paper - **Importance of Batch - Iterative Solvers**: Batch - iterative solvers play an important role in computational science, especially in fields such as plasma physics and combustion simulation. These solvers can efficiently solve a batch of small and medium - sized sparse problems. - **Existing Challenges**: As more and more supercomputers adopt non - NVIDIA GPUs (such as Intel and AMD GPUs), it is necessary to port existing batch solvers from the CUDA ecosystem to other programming models, such as SYCL. - **Advantages of SYCL**: SYCL, as a cross - platform abstraction layer, has good performance efficiency and portability and is suitable for different architectures (such as CPUs, GPUs, and FPGAs). ### Main Contributions 1. **Successful Porting**: Successfully port the batch - iterative solvers to Intel GPUs using the SYCL programming model. 2. **Performance Tuning**: Perform performance tuning for batch solvers with different matrix sizes. 3. **Performance Evaluation**: Evaluate the performance of the batch - sparse - iterative solvers on Intel GPUs and compare it with the latest NVIDIA H100 GPU, both using the vendors' native programming models (SYCL and CUDA). 4. **Practical Applications**: Use matrices from the Pele reactive flow simulation application. These matrices use SUNDIALS to solve ODE linear systems and are suitable for batch - processing solutions. ### Technical Details - **Batch Matrix Formats**: Support multiple batch matrix formats, including BatchCsr, BatchEll, and BatchDense, to adapt to different types of sparse matrices. - **Batch Solver Kernels**: Implement multiple batch - iterative - solver kernels, such as CG, BiCGSTAB, and GMRES, as well as related BLAS operations. - **Multi - level Scheduling Mechanism**: Design a multi - level scheduling mechanism that allows the runtime to select different matrix formats, solvers, stopping criteria, and pre - conditioners. - **Minimize Kernel Launch Latency**: Package all functions into one kernel to reduce kernel - launch overhead. - **Maximize Local Memory Usage**: Improve memory - access efficiency by rationally allocating intermediate vectors and pre - conditioned matrices to shared local memory (SLM). - **Optimization Based on Matrix Size**: Dynamically select an appropriate work - group size according to the size of the input matrix to optimize performance. ### Conclusion Through these efforts, the authors demonstrate that the batch - iterative solvers implemented using SYCL on Intel GPUs can not only match but even exceed the performance of the CUDA implementation on NVIDIA H100 GPUs, and also have good performance portability and production - readiness. These results provide a solid foundation for future applications in the field of scientific computing.

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

Heterogeneous Programming and Optimization of Gyrokinetic Toroidal Code and Large-Scale Performance Test on TH-1A.

Porting a sparse linear algebra math library to Intel GPUs

Providing performance portable numerics for Intel GPUs

A Comparison of the Performance of the Molecular Dynamics Simulation Package GROMACS Implemented in the SYCL and CUDA Programming Models

Gaining Cross-Platform Parallelism for HAL's Molecular Dynamics Package using SYCL

Massive parallelization and performance enhancement of an immersed boundary method based unsteady flow solver

GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability

Towards a platform-portable linear algebra backend for OpenFOAM

Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels

Multidisciplinary simulation acceleration using multiple shared memory graphical processing units

Black-Scholes Option Pricing on Intel CPUs and GPUs: Implementation on SYCL and Optimization Techniques

Method for portable, scalable, and performant GPU-accelerated simulation of multiphase compressible flow

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

Automating GPU Scalability for Complex Scientific Models: Phonon Boltzman Transport Equation

Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results

Multi-GPU kinetic solvers using MPI and CUDA

Accelerating Iterative Linear Solvers Using Multiple Graphical Processing Units

OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs

An Evaluation and Comparison of GPU Hardware and Solver Libraries for Accelerating the OPM Flow Reservoir Simulator

Multi-GPU aggregation-based AMG preconditioner for iterative linear solvers