Scalable Multi-node Fast Fourier Transform on GPUs

Manthan Verma,Soumyadeep Chatterjee,Gaurav Garg,Bharatkumar Sharma,Nishant Arya,Sashi Kumar,Anish Saxena,Mahendra K. Verma
DOI: https://doi.org/10.1007/s42979-023-02109-0
2023-08-20
SN Computer Science
Abstract:In this paper, we present the details of our multi-node GPU-FFT library, as well its scaling on Selene HPC system. It is one of the first attempts to develop an object-oriented open-source multi-node multi-GPU FFT library by combining cuFFT, CUDA, and MPI. Our library employs slab decomposition for data division and Cuda-aware MPI for communication among GPUs. To minimize communication overheads, we employ a combination of asynchronous MPI _ Isend and MPI _ Irecv , along with MPI_Waitall and cudaMemcpy , instead of using MPI _ Alltoall . We conducted scaling analysis of our GPU-FFT library for grid sizes of , , and , utilizing up to 512 A100 GPUs. We achieved linear scaling for the grid when using 64 to 512 GPUs. We report that the timings of multicore FFT of grid with 196608 cores of Cray XC40 is comparable to that of GPU-FFT of grid with 128 GPUs. The efficiency of GPU-FFT is due to the fast computation capabilities of A100 card and efficient communication via NVlink.
English Else
What problem does this paper attempt to address?