Abstract:The optimization of Fast Fourier Transfer (FFT) problems that can fit into GPU memory has been studied extensively. Such on-card FFT libraries like CUFFT can generally achieve much better performance than their counterparts on a CPU, as the data transfer between CPU and GPU is usually not counted in their performance. This high performance, however, is limited by the GPU memory size. When the FFT problem size increases, the data transfer between system and GPU memory can comprise a substantial part of the overall execution time. Therefore, optimizations for FFT problems that outgrow the GPU memory can not bypass the tuning of data transfer between CPU and GPU. However, no prior study has attacked this problem. This paper is the first effort of using GPUs to efficiently compute large FFTs in the CPU memory of a single compute node. In this paper, the performance of the PCI bus during the transfer of a batch of FFT subarrays is studied and a blocked buffer algorithm is proposed to improve the effective bandwidth. More importantly, several FFT decomposition algorithms are proposed so as to increase the data locality, further improve the PCI bus efficiency and balance computation between kernels. By integrating the above two methods, we demonstrate an out-of-card FFT optimization strategy and develop an FFT library that efficiently computes large 1D, 2D and 3D FFTs that can not fit into the GPU's memory. On three of the latest GPUs, our large FFT library achieves much better double precision performance than two of the most efficient CPU based libraries, FFTW and Intel MKL. On average, our large FFTs on a single GeForce GTX480 are 46% faster than FFTW and 57% faster than MKL with multiple threads running on a four-core Intel i7 CPU. The speedup on a Tesla C2070 is 1.93x and 2.11x over FFTW and MKL. A peak performance of 21GFLOPS is achieved for a 2D FFT of size 2048x65536 on C2070 with double precision.

Large-Scale Fast Fourier Transform

Large-scale FFT on GPU clusters

Using GPUs to compute large out-of-card FFTs

MFFT: A GPU Accelerated Highly Efficient Mixed-precision Large-scale FFT Framework

A GPU Based Memory Optimized Parallel Method For FFT Implementation

Scalable Multi-node Fast Fourier Transform on GPUs

Accelerating Fast Fourier Transforms Using Hadoop and CUDA

Performance Enhancement of GPU Parallel Computing Using Memory Allocation Optimization

Fast computation of general Fourier Transforms on GPUS

Fast Fourier transforms for the evaluation of convolution products: CPU versus GPU implementation

cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs

Research on the fast Fourier transform of image based on GPU

High performance multi-dimensional (2D/3D) FFT-Shift implementation on Graphics Processing Units (GPUs)

Efficient FFT mapping on GPU for radar processing application: modeling and implementation

HI-FFT: Heterogeneous Parallel In-Place Algorithm for Large-Scale 2D-FFT

Large-Scale Discrete Fourier Transform on TPUs

AccFFT: A library for distributed-memory FFT on CPU and GPU architectures

MPFFT:An Auto-Tuning FFT Library for OpenCL GPUs

tcFFT: Accelerating Half-Precision FFT through Tensor Cores