Abstract:Fast Fourier Transform (FFT) is frequently invoked in stream processing, e.g., calculating the spectral representation of audio/video frames, and in many cases the inputs are sparse, i.e., most of the inputs' Fourier coefficients being zero. Many sparse FFT algorithms have been proposed to improve FFT's efficiency when inputs are known to be sparse. However, like their "dense" counterparts, existing sparse FFT implementations are input oblivious in the sense that how the algorithms work is not affected by the value of input. The sparse FFT computation on one frame is exactly the same as the computation on the next frame. This paper improves upon existing sparse FFT algorithms by simultaneously exploiting the input sparsity and the similarity between adjacent inputs in stream processing. Our algorithm detects and takes advantage of the similarity between input samples to automatically design and customize sparse filters that lead to better parallelism and performance. More specifically, we develop an efficient heuristic to detect the similarity between the current input to its predecessor in stream processing, and when it is found to be similar, we novelly use the spectral representation of the predecessor to accelerate the sparse FFT computation on the current input. Given a sparse signal that has only $k$ non-zero Fourier coefficients, our algorithm utilizes sparse approximation by tuning several adaptive filters to efficiently package the non-zero Fourier coefficients into a small number of bins which can then be estimated accurately. Therefore, our algorithm has runtime sub-linear to the input size and gets rid of recursive coefficient estimation, both of which improve parallelism and performance. Furthermore, the new heuristic can detect the discontinuities inside the streams and resumes the input adaptation very quickly. We evaluate our input-adaptive sparse FFT implementation on Intel i7 CPU and three NVIDIA GPUs, i.e., NVIDIA GeForce GTX480, Tesla C2070 and Tesla C2075. Our algorithm is faster than previous FFT implementations both in theory and implementation. For inputs with size N=2^{24}, our parallel implementation outperforms FFTW for k up to 2^{18}, which is an order of magnitude higher than prior sparse algorithms. Furthermore, our input adaptive sparse FFT on Tesla C2075 GPU achieves up to 77.2x and 29.3x speedups over 1-thread and 4-thread FFTW, 10.7x, 6.4x, 5.2x speedups against sFFT 1.0, sFFT 2.0, CUFFT, and 6.9x speedup over our sequential CPU performance, respectively.

Input-adaptive Parallel Sparse Fast Fourier Transform for Stream Processing

An Input-Adaptive Algorithm for High Performance Sparse Fast Fourier Transform.

Parallel Sparse FFT.

Streaming FFT Asynchronously on Graphics Processor Units

Parallel Optimization and Hardware Customization for Fast Fourier Transform

An empirically tuned 2D and 3D FFT library on CUDA GPU.

Aesptv: an Adaptive and Efficient Framework for Sparse Tensor-Vector Product Kernel on a High-Performance Computing Platform

An Efficient Data Layout Transformation Algorithm for Locality-Aware Parallel Sparse Fft

Parallel Fast Fourier Transform in SPMD Style of Cilk.

HI-FFT: Heterogeneous Parallel In-Place Algorithm for Large-Scale 2D-FFT

Using GPUs to compute large out-of-card FFTs

MFFT: A GPU Accelerated Highly Efficient Mixed-precision Large-scale FFT Framework

OpenFFT: an Adaptive Tuning Framework for 3D FFT on ARM Multicore CPUs.

A Hybrid Gpu/Cpu Fft Library For Large Fft Problems

Some New Parallel Fast Fourier Transform Algorithms

Empirical Evaluation of Typical Sparse Fast Fourier Transform Algorithms.

An Optimized Parallel FFT Algorithm on Multiprocessors with Cache Technology in Linux

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

Intel Cilk Plus for Complex Parallel Algorithms: "Enormous Fast Fourier Transform" (EFFT) Library