Automatic Generation of High-Performance FFT Kernels on Arm and X86 CPUs

Zhihao Li,Haipeng Jia,Yunquan Zhang,Tun Chen,Liang Yuan,Richard Vuduc
DOI: https://doi.org/10.1109/tpds.2020.2977629
IF: 5.3
2020-08-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:This article presents AutoFFT, a template-based code generation framework that can automatically generate high-performance FFT kernels for all natural-number radices. AutoFFT is based on the Cooley-Tukey FFT algorithm, which exploits the symmetric and periodic properties of the DFT matrix, as the outer parallelization framework. Because butterflies are the core operations of the Cooley-Tukey algorithm, we explore additional symmetric and periodic properties of the DFT matrix and formulate multiple optimized calculation templates to further reduce the number of floating-point operations for butterflies of arbitrary natural numbers. To fully exploit hardware resources, we encapsulate a series of optimizations in an assembly template optimizer. Given any DFT problem, AutoFFT automatically generates C FFT kernels using these calculation templates and converts them into efficient assembly kernels using the template optimizer. Through a series of experiments on Arm, Intel, and AMD processors, we show that AutoFFT-generated kernels can outperform those in Fastest Fourier Transform in the West (FFTW), the Arm Performance Libraries (ARMPL), and the Intel Math Kernel Library (MKL).
computer science, theory & methods,engineering, electrical & electronic
What problem does this paper attempt to address?