MFFT: A GPU Accelerated Highly Efficient Mixed-precision Large-scale FFT Framework

Yuwen Zhao,Fangfang Liu,Wenjing Ma,Huiyuan Li,Yuanchi Peng,Cui Wang
DOI: https://doi.org/10.1145/3605148
IF: 1.444
2023-06-19
ACM Transactions on Architecture and Code Optimization
Abstract:Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4096 GPUs. The results show that shared-exponent MFFT is 1.23x faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53x and 9.48x on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from \(53.2\% \) to \(78.1\% \) compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to \(83.8\% \) .
computer science, theory & methods, hardware & architecture
What problem does this paper attempt to address?