Abstract:The optimization of Fast Fourier Transfer (FFT) problems that can fit into GPU memory has been studied extensively. Such on-card FFT libraries like CUFFT can generally achieve much better performance than their counterparts on a CPU, as the data transfer between CPU and GPU is usually not counted in their performance. This high performance, however, is limited by the GPU memory size. When the FFT problem size increases, the data transfer between system and GPU memory can comprise a substantial part of the overall execution time. Therefore, optimizations for FFT problems that outgrow the GPU memory can not bypass the tuning of data transfer between CPU and GPU. However, no prior study has attacked this problem. This paper is the first effort of using GPUs to efficiently compute large FFTs in the CPU memory of a single compute node. In this paper, the performance of the PCI bus during the transfer of a batch of FFT subarrays is studied and a blocked buffer algorithm is proposed to improve the effective bandwidth. More importantly, several FFT decomposition algorithms are proposed so as to increase the data locality, further improve the PCI bus efficiency and balance computation between kernels. By integrating the above two methods, we demonstrate an out-of-card FFT optimization strategy and develop an FFT library that efficiently computes large 1D, 2D and 3D FFTs that can not fit into the GPU's memory. On three of the latest GPUs, our large FFT library achieves much better double precision performance than two of the most efficient CPU based libraries, FFTW and Intel MKL. On average, our large FFTs on a single GeForce GTX480 are 46% faster than FFTW and 57% faster than MKL with multiple threads running on a four-core Intel i7 CPU. The speedup on a Tesla C2070 is 1.93x and 2.11x over FFTW and MKL. A peak performance of 21GFLOPS is achieved for a 2D FFT of size 2048x65536 on C2070 with double precision.

On the Use of Small 2D Convolutions on GPUs

Accelerated 2D Image Processing on GPUs

Fast and High-Resolution Acoustic Beamforming: A Convolution Accelerated Deconvolution Implementation

Revisiting Convolution and FFT on Parallel Computation Platforms

Fast Fourier transforms for the evaluation of convolution products: CPU versus GPU implementation

High performance multi-dimensional (2D/3D) FFT-Shift implementation on Graphics Processing Units (GPUs)

Using GPUs to compute large out-of-card FFTs

Efficient FFT mapping on GPU for radar processing application: modeling and implementation

High Performance Implementation of 3D Convolutional Neural Networks on a GPU

Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

GPU Acceleration of Image Convolution using Spatially-varying Kernel

A Parallel Implementation of the 2D Wavelet Transform Using CUDA

Research on the fast Fourier transform of image based on GPU

Fast 2D Convolutions and Cross-Correlations Using Scalable Architectures

Accelerating Fast Fourier Transforms Using Hadoop and CUDA

Performant low-order matrix-free finite element kernels on GPU architectures

Techniques For Efficient Dct/Idct Implementation On Generic Gpu

A High Utilization FPGA-Based Accelerator for Variable-Scale Convolutional Neural Network

GPU Fast Convolution via the Overlap-and-Save Method in Shared Memory

Fast Algorithms and Efficient GPU Implementations for the Radon Transform and the Back-Projection Operator Represented as Convolution Operators