Heterogeneous acceleration algorithms for shallow cumulus convection scheme over GPU clusters

Fei Li,Yuzhu Wang,Jinrong Jiang,He Zhang,Xiaocong Wang,Xuebin Chi
DOI: https://doi.org/10.1016/j.future.2023.04.021
IF: 7.307
2023-04-25
Future Generation Computer Systems
Abstract:The physical process of atmospheric cumulus convection plays a crucial role in climate modeling, and its complex computational process severely restricts the development of high-resolution climate models. Accelerating the cumulus convective process calculation in climate models is a significant challenge. Traditional CPU-accelerated computing is increasingly unable to meet the growing demand for computing resources from high-resolution climate models. Therefore, developing an efficient cumulus convection scheme is quite valuable and necessary. In response to this demand, this paper selects the University of Washington shallow cumulus (UWshcu) model as the research object and proposes its parallel algorithms suitable for large-scale, heterogeneous, high-performance computing systems: (1) the single GPU acceleration algorithm based on CUDA C, namely GPU-UWshcu; (2) the multi-NVIDIA GPUs acceleration algorithm based on the MPI+CUDA hybrid programming model, namely CGPUs-UWshcu; (3) the multi-AMD GPUs acceleration algorithm based on MPI+HIP, namely HGPUs-UWshcu; (4) the multiple CPUs+GPUs acceleration algorithm based on MPI+OpenMP+HIP, namely MOH-UWshcu. Experimental results show that these algorithms are efficient and have good scalability. GPU-UWshcu achieves a speedup of 74.39 × on a single Tesla V100 GPU compared with the serial algorithm running on an Intel Xeon E5-2680 v2 CPU core, and CGPUs-UWshcu achieves a 151.22 × speedup on 16 T V100 GPUs compared to a single Intel Xeon E5-2630 v4 CPU (10 cores). On the ORISE supercomputer, HGPUs-UWshcu uses 1024 AMD GPUs to achieve a 664.65 × speedup compared to a single CPU (32 cores), with a parallel efficiency of 68.91% compared to using 32 GPUs. Compared to using the same number of CPU cores, MOH-UWshcu uses 8192 CPU cores+1024 GPUs to achieve a speedup of 4.98 × , with 55.22 TFLOPS in double precision.
computer science, theory & methods
What problem does this paper attempt to address?