Abstract:Fast Fourier Transform (FFT) is frequently invoked in stream processing, e.g., calculating the spectral representation of audio/video frames, and in many cases the inputs are sparse, i.e., most of the inputs' Fourier coefficients being zero. Many sparse FFT algorithms have been proposed to improve FFT's efficiency when inputs are known to be sparse. However, like their "dense" counterparts, existing sparse FFT implementations are input oblivious in the sense that how the algorithms work is not affected by the value of input. The sparse FFT computation on one frame is exactly the same as the computation on the next frame. This paper improves upon existing sparse FFT algorithms by simultaneously exploiting the input sparsity and the similarity between adjacent inputs in stream processing. Our algorithm detects and takes advantage of the similarity between input samples to automatically design and customize sparse filters that lead to better parallelism and performance. More specifically, we develop an efficient heuristic to detect the similarity between the current input to its predecessor in stream processing, and when it is found to be similar, we novelly use the spectral representation of the predecessor to accelerate the sparse FFT computation on the current input. Given a sparse signal that has only $k$ non-zero Fourier coefficients, our algorithm utilizes sparse approximation by tuning several adaptive filters to efficiently package the non-zero Fourier coefficients into a small number of bins which can then be estimated accurately. Therefore, our algorithm has runtime sub-linear to the input size and gets rid of recursive coefficient estimation, both of which improve parallelism and performance. Furthermore, the new heuristic can detect the discontinuities inside the streams and resumes the input adaptation very quickly. We evaluate our input-adaptive sparse FFT implementation on Intel i7 CPU and three NVIDIA GPUs, i.e., NVIDIA GeForce GTX480, Tesla C2070 and Tesla C2075. Our algorithm is faster than previous FFT implementations both in theory and implementation. For inputs with size N=2^{24}, our parallel implementation outperforms FFTW for k up to 2^{18}, which is an order of magnitude higher than prior sparse algorithms. Furthermore, our input adaptive sparse FFT on Tesla C2075 GPU achieves up to 77.2x and 29.3x speedups over 1-thread and 4-thread FFTW, 10.7x, 6.4x, 5.2x speedups against sFFT 1.0, sFFT 2.0, CUFFT, and 6.9x speedup over our sequential CPU performance, respectively.

Aesptv: an Adaptive and Efficient Framework for Sparse Tensor-Vector Product Kernel on a High-Performance Computing Platform

fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms

A Pipeline Computing Method of SpTV for Three-Order Tensors on CPU and GPU

AG-SpTRSV: an Automatic Framework to Optimize Sparse Triangular Solve on GPUs

IAP-SpTV: An Input-aware Adaptive Pipeline SpTV via GCN on CPU-GPU

FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning

A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU Via GCN

Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer

Performance Optimization for Sparse A(T)Ax in Parallel on Multicore Cpu

A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010

Exploiting Hierarchical Parallelism and Reusability in Tensor Kernel Processing on Heterogeneous HPC Systems

Tpspmv: A Two-Phase Large-Scale Sparse Matrix-Vector Multiplication Kernel for Manycore Architectures

Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor Contraction

ahSpMV: An Autotuning Hybrid Computing Scheme for SpMV on the Sunway Architecture

Input-adaptive Parallel Sparse Fast Fourier Transform for Stream Processing

Efficient Processing of Sparse Tensor Decomposition via Unified Abstraction and PE-Interactive Architecture

A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-Acceleration

CASpMV: A Customized and Accelerative SpMV Framework for the Sunway TaihuLight

hpSpMV: A Heterogeneous Parallel Computing Scheme for SpMV on the Sunway TaihuLight Supercomputer

Optimizing Sparse Tensor Times Matrix on Multi-Core and Many-Core Architectures

Optimizing Sparse Tensor Times Matrix on GPUs