Large-Scale Fast Fourier Transform

Yifeng Chen,Xiang Cui,Hong Mei
DOI: https://doi.org/10.1016/b978-0-12-384988-5.00039-5
2011-01-01
Abstract:This chapter shows how to achieve substantial speedups on large-scale fast Fourier transfers (FFTs). These FFTs are hard to accelerate, without data locality, on GPU clusters because the bottleneck often lies with the PCI bus or the communication network. Optimizing FFT for a single-GPU device does not improve the overall performance. A GPU cluster is a network-connected workstation cluster with one or more GPU devices on each node. Computation-intensive tasks, such as dense matrix multiplication and Linpack, are often easy to accelerate by GPUs. Most existing codes assume FFTs to have a “small scale” so that the entire user data can be held in one GPU's dmem. It fits an application scenario in which FFT is performed repeatedly on these data. Then the overhead to transfer the source data from and to the host memory is overwhelmed by the computation time. This chapter considers a “large-scale” FFT whose dataset is too large to fit in one GPU's dmem and requires multiple GPU nodes. Three GPU-related factors contribute to better performance: first, the use of GPU devices improves the sustained memory bandwidth for processing large-size data; second, GPU device memory allows larger subtasks to be processed in whole and hence reduces repeated data transfers between memory and processors; and finally some costly main-memory operations such as matrix transposition can be significantly sped up by GPUs if necessary data adjustment is performed during data transfers. The technique of manipulating array dimensions during data transfer is the main technical contribution.
What problem does this paper attempt to address?