Abstract:Batched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, there is a compelling demand to expand the capabilities of these solvers for Intel GPU architectures.
In this paper, we present our efforts in porting and optimizing the batched iterative solvers on Intel GPUs using the SYCL programming model. These new solvers achieve impressive performance on the Intel GPU Max 1550s (Ponte Vecchio GPUs) which surpass our previous CUDA implementation on NVIDIA H100 GPUs by an average of 2.4x for the PeleLM application inputs. The batched solvers are ready for production use in real-world scientific applications through the Ginkgo library, complementing the performance portability of the batched functionality of Ginkgo.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to port batch - iterative solvers to Intel GPUs and optimize them using the SYCL programming model. Specifically, the authors are concerned with how to scale the capabilities of these solvers to adapt to the Intel GPU architecture on upcoming supercomputers equipped with Intel GPUs, such as Aurora. Through this work, they hope to improve the performance of these solvers on Intel GPUs, especially when dealing with small and medium - sized sparse problems in fields such as plasma physics and combustion simulation.
### Background and Motivation of the Paper
- **Importance of Batch - Iterative Solvers**: Batch - iterative solvers play an important role in computational science, especially in fields such as plasma physics and combustion simulation. These solvers can efficiently solve a batch of small and medium - sized sparse problems.
- **Existing Challenges**: As more and more supercomputers adopt non - NVIDIA GPUs (such as Intel and AMD GPUs), it is necessary to port existing batch solvers from the CUDA ecosystem to other programming models, such as SYCL.
- **Advantages of SYCL**: SYCL, as a cross - platform abstraction layer, has good performance efficiency and portability and is suitable for different architectures (such as CPUs, GPUs, and FPGAs).
### Main Contributions
1. **Successful Porting**: Successfully port the batch - iterative solvers to Intel GPUs using the SYCL programming model.
2. **Performance Tuning**: Perform performance tuning for batch solvers with different matrix sizes.
3. **Performance Evaluation**: Evaluate the performance of the batch - sparse - iterative solvers on Intel GPUs and compare it with the latest NVIDIA H100 GPU, both using the vendors' native programming models (SYCL and CUDA).
4. **Practical Applications**: Use matrices from the Pele reactive flow simulation application. These matrices use SUNDIALS to solve ODE linear systems and are suitable for batch - processing solutions.
### Technical Details
- **Batch Matrix Formats**: Support multiple batch matrix formats, including BatchCsr, BatchEll, and BatchDense, to adapt to different types of sparse matrices.
- **Batch Solver Kernels**: Implement multiple batch - iterative - solver kernels, such as CG, BiCGSTAB, and GMRES, as well as related BLAS operations.
- **Multi - level Scheduling Mechanism**: Design a multi - level scheduling mechanism that allows the runtime to select different matrix formats, solvers, stopping criteria, and pre - conditioners.
- **Minimize Kernel Launch Latency**: Package all functions into one kernel to reduce kernel - launch overhead.
- **Maximize Local Memory Usage**: Improve memory - access efficiency by rationally allocating intermediate vectors and pre - conditioned matrices to shared local memory (SLM).
- **Optimization Based on Matrix Size**: Dynamically select an appropriate work - group size according to the size of the input matrix to optimize performance.
### Conclusion
Through these efforts, the authors demonstrate that the batch - iterative solvers implemented using SYCL on Intel GPUs can not only match but even exceed the performance of the CUDA implementation on NVIDIA H100 GPUs, and also have good performance portability and production - readiness. These results provide a solid foundation for future applications in the field of scientific computing.