M2L Translation Operators for Kernel Independent Fast Multipole Methods on Modern Architectures

Srinath Kailasa,Timo Betcke,Sarah El Kazdadi
2024-08-14
Abstract:Current and future trends in computer hardware, in which the disparity between available flops and memory bandwidth continues to grow, favour algorithm implementations which minimise data movement even at the cost of more flops. In this study we review the requirements for high performance implementations of the kernel independent Fast Multipole Method (kiFMM), a variant of the crucial FMM algorithm for the rapid evaluation of N-body potential problems. Performant implementations of the kiFMM typically rely on Fast Fourier Transforms for the crucial M2L (Multipole-to-Local) operation. However, in recent years for other FMM variants such as the black-box FMM also BLAS based M2L translation operators have become popular that rely on direct matrix compression techniques. In this paper we present algorithmic improvements for BLAS based M2L translation operator and benchmark them against FFT based M2L translation operators. In order to allow a fair comparison we have implemented our own high-performance kiFMM algorithm in Rust that performs competitively against other implementations, and allows us to flexibly switch between BLAS and FFT based translation operators.
Computational Engineering, Finance, and Science
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: In modern computer architectures, as the gap between floating - point operation capabilities (FLOPS) and memory bandwidth continues to widen, how to optimize the key operation in the Fast Multipole Method (FMM), namely the Multipole - to - Local (M2L) conversion, in order to reduce data movement and improve computational performance. Specifically, the paper compares the performance of the BLAS - based M2L translation operator and the FFT - based M2L translation operator in the kernel - independent Fast Multipole Method (kiFMM), and explores the feasibility of BLAS - M2L in three - dimensional problems. ### Detailed Explanation 1. **Background Problems**: - The development trend of modern computer hardware makes algorithm implementation more inclined to minimize data movement, even if this means more floating - point operations are required. - The Fast Multipole Method (FMM) is an algorithm used to accelerate the solution of N - body potential energy problems, and its complexity is reduced from O(NM) to O(P(N + M)), where P is a small constant that controls precision. - In kiFMM, the M2L operation usually depends on the Fast Fourier Transform (FFT), but in recent years, BLAS - based M2L translation operators have also become popular, and these operators rely on direct matrix compression techniques. 2. **Research Objectives**: - Evaluate the performance of the BLAS - based M2L translation operator in three - dimensional problems, especially for the Laplace kernel. - Compare the performance of BLAS - M2L and FFT - M2L in kiFMM, especially on modern processor architectures. 3. **Main Contributions**: - Propose algorithmic improvements to the BLAS - M2L translation operator to increase cache reuse rate and maximize algorithmic strength. - Conduct extensive benchmark tests using a custom high - performance kiFMM implementation, comparing the performance of FFT - M2L and BLAS - M2L under the three - dimensional Laplace kernel. 4. **Research Methods**: - Develop a high - performance kiFMM implementation in the Rust language, allowing flexible switching between BLAS - and FFT - based translation operators. - Evaluate the performance of the two translation operators through single - node benchmark tests and analyze the results. ### Conclusion This paper aims to explore the competitiveness of BLAS - M2L relative to FFT - M2L on modern computer architectures by optimizing the BLAS - M2L translation operator, especially in three - dimensional problems. Research shows that, with appropriate optimization, BLAS - M2L can achieve performance comparable to or even better than FFT - M2L in some cases.