Abstract:The fast Fourier transform (FFT) algorithm is considered as an important element of the processor’s basic software ecology, and it is widely applied in the field of engineering, science, physics and mathematics. Meanwhile, the requirements for the performance of FFT in these applications are also continuously rising. Therefore, it is of definite significance to study the high-performance implementation of FFT algorithm, especially the high-performance implementation of large radices of FFT in ARMv8 and X86-64, and to improve the calculation performance of FFT algorithm. In view of the architectural features of the ARMv8 and X86-64 computing platforms, this paper studies the high-performance implementation and optimization methods of the FFT algorithm. Through the application of butterfly network optimization, large radices network stages decrease, large radices butterfly computation optimization, SIMD (single instruction multiple data) assembly optimization, and register usage optimization methods, this paper effectively improves the performance of the FFT algorithm, considerably improves the calculation performance of the large radices of FFT, and solves the performance bottlenecks of insufficiency of register resources. Lastly, the summary of a set of Cooley-Tukey FFT algorithm high-performance implementation strategies and optimization solutions is made. The experimental results indicate that for the ARM and X86-64 processors, the FFT algorithm implemented can achieve a significant improvement in performance compared with ARMPL (ARM performance library), Intel MKL (math kernel library) and FFTW (fastest Fourier transform in the West) and can achieve a significant improvement in performance compared with small and medium radices.

Optimizing FFT-Based Convolution on ARMv8 Multi-core CPUs

NUMA-aware FFT-based Convolution on ARMv8 Many-core CPUs

Optimizing One by One Direct Convolution on ARMv8 Multi-core CPUs

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Optimizing Winograd Convolution on ARMv8 processors

Running 8-Bit Dynamic Fixed-Point Convolutional Neural Network on Low-Cost ARM Platforms

Optimizing Half Precision Winograd Convolution on ARM Many-Core Processors

Optimizing Direct Convolutions on ARM Multi-Cores

Efficient Convolution Architectures for Convolutional Neural Network

Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures.

Efficient Convolutional Neural Networks Utilizing Fine-Grained Fast Fourier Transforms

Efficient Hardware Architectures for Deep Convolutional Neural Network

Efficient Reconfigurable Hardware Core for Convolutional Neural Networks.

Parallelizing Convolutional Neural Networks On Intel (R) Many Integrated Core Architecture

Optimization of Convolution Neural Network Algorithm Based on FPGA

High-Performance Implementation and Optimization of Cooley-Tukey FFT Algorithm

ABM-SpConv-SIMD: Accelerating Convolutional Neural Network Inference for Industrial IoT Applications on Edge Devices

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

Acceleration Performance Study of Convolutional Neural Network Based on Split-radix-2/(2a) FFT Algorithms

Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning