Abstract:The optimization of Fast Fourier Transfer (FFT) problems that can fit into GPU memory has been studied extensively. Such on-card FFT libraries like CUFFT can generally achieve much better performance than their counterparts on a CPU, as the data transfer between CPU and GPU is usually not counted in their performance. This high performance, however, is limited by the GPU memory size. When the FFT problem size increases, the data transfer between system and GPU memory can comprise a substantial part of the overall execution time. Therefore, optimizations for FFT problems that outgrow the GPU memory can not bypass the tuning of data transfer between CPU and GPU. However, no prior study has attacked this problem. This paper is the first effort of using GPUs to efficiently compute large FFTs in the CPU memory of a single compute node. In this paper, the performance of the PCI bus during the transfer of a batch of FFT subarrays is studied and a blocked buffer algorithm is proposed to improve the effective bandwidth. More importantly, several FFT decomposition algorithms are proposed so as to increase the data locality, further improve the PCI bus efficiency and balance computation between kernels. By integrating the above two methods, we demonstrate an out-of-card FFT optimization strategy and develop an FFT library that efficiently computes large 1D, 2D and 3D FFTs that can not fit into the GPU's memory. On three of the latest GPUs, our large FFT library achieves much better double precision performance than two of the most efficient CPU based libraries, FFTW and Intel MKL. On average, our large FFTs on a single GeForce GTX480 are 46% faster than FFTW and 57% faster than MKL with multiple threads running on a four-core Intel i7 CPU. The speedup on a Tesla C2070 is 1.93x and 2.11x over FFTW and MKL. A peak performance of 21GFLOPS is achieved for a 2D FFT of size 2048x65536 on C2070 with double precision.

Improving Performance of Matrix Multiplication and FFT on GPU

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Performance Modeling and Optimization of Sparse Matrix-Vector Multiplication on NVIDIA CUDA Platform

Large-scale FFT on GPU clusters

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

High Performance Matrix Multiplication on Many Cores

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Optimizing sparse general matrix–matrix multiplication for DCUs

Using GPUs to compute large out-of-card FFTs

Optimizing sparse matrix-vector multiplication based on gpu

Improvement of Sparse Matrix-Vector Multiplication on GPU

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

MFFT: A GPU Accelerated Highly Efficient Mixed-precision Large-scale FFT Framework

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

Acceleration of Tensor-Product Operations with Tensor Cores

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

High Performance Computing Via a GPU

Parallel optimization for sparse matrix-vector on GPU