Abstract:Multiphase compressible flows are often characterized by a broad range of space and time scales, entailing large grids and small time steps. Simulations of these flows on CPU-based clusters can thus take several wall-clock days. Offloading the compute kernels to GPUs appears attractive but is memory-bound for many finite-volume and -difference methods, damping speedups. Even when realized, GPU-based kernels lead to more intrusive communication and I/O times owing to lower computation costs. We present a strategy for GPU acceleration of multiphase compressible flow solvers that addresses these challenges and obtains large speedups at scale. We use OpenACC for directive-based offloading of all compute kernels while maintaining low-level control when needed. An established Fortran preprocessor and metaprogramming tool, Fypp, enables otherwise hidden compile-time optimizations. This strategy exposes compile-time optimizations and high memory reuse while retaining readable, maintainable, and compact code. Remote direct memory access realized via CUDA-aware MPI and GPUDirect reduces halo-exchange communication time. We implement this approach in the open-source solver MFC [1] . Metaprogramming results in an 8-times speedup of the most expensive kernels compared to a statically compiled program, reaching 46% of peak FLOPs on modern NVIDIA GPUs and high arithmetic intensity (about 10 FLOPs/byte). In representative simulations, a single NVIDIA A100 GPU is 7-times faster compared to an Intel Xeon Cascade Lake (6248) CPU die, or about 300-times faster compared to a single such CPU core. At the same time, near-ideal (97%) weak scaling is observed for at least 13824 GPUs on OLCF Summit. A strong scaling efficiency of 84% is retained for an 8-times increase in GPU count. Collective I/O, implemented via MPI3, helps ensure the negligible contribution of data transfers ( <1% of the wall time for a typical, large simulation). Large many-GPU simulations of compressible (solid-)liquid-gas flows demonstrate the practical utility of this strategy.

Scalable Multi-node Fast Fourier Transform on GPUs

Large-scale FFT on GPU clusters

Using GPUs to compute large out-of-card FFTs

Accelerating Fast Fourier Transforms Using Hadoop and CUDA

AccFFT: A library for distributed-memory FFT on CPU and GPU architectures

Heterogeneous Programming and Optimization of Gyrokinetic Toroidal Code and Large-Scale Performance Test on TH-1A.

Large-Scale Fast Fourier Transform

Fast computation of general Fourier Transforms on GPUS

cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs

MFFT: A GPU Accelerated Highly Efficient Mixed-precision Large-scale FFT Framework

Fast and Scalable FFT-Based GPU-Accelerated Algorithms for Hessian Actions Arising in Linear Inverse Problems Governed by Autonomous Dynamical Systems

Fast hardware-aware matrix-free algorithm for higher-order finite-element discretized matrix multivector products on distributed systems

Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems

OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs

High performance multi-dimensional (2D/3D) FFT-Shift implementation on Graphics Processing Units (GPUs)

CuMF_SGD: Fast and Scalable Matrix Factorization.

MPFFT:An Auto-Tuning FFT Library for OpenCL GPUs

FLUPS -- a flexible and performant massively parallel Fourier transform library

Method for portable, scalable, and performant GPU-accelerated simulation of multiphase compressible flow

Method for scalable and performant GPU-accelerated simulation of multiphase compressible flow