Abstract:Current and future trends in computer hardware, in which the disparity between available flops and memory bandwidth continues to grow, favour algorithm implementations which minimise data movement even at the cost of more flops. In this study we review the requirements for high performance implementations of the kernel independent Fast Multipole Method (kiFMM), a variant of the crucial FMM algorithm for the rapid evaluation of N-body potential problems. Performant implementations of the kiFMM typically rely on Fast Fourier Transforms for the crucial M2L (Multipole-to-Local) operation. However, in recent years for other FMM variants such as the black-box FMM also BLAS based M2L translation operators have become popular that rely on direct matrix compression techniques. In this paper we present algorithmic improvements for BLAS based M2L translation operator and benchmark them against FFT based M2L translation operators. In order to allow a fair comparison we have implemented our own high-performance kiFMM algorithm in Rust that performs competitively against other implementations, and allows us to flexibly switch between BLAS and FFT based translation operators.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: In modern computer architectures, as the gap between floating - point operation capabilities (FLOPS) and memory bandwidth continues to widen, how to optimize the key operation in the Fast Multipole Method (FMM), namely the Multipole - to - Local (M2L) conversion, in order to reduce data movement and improve computational performance. Specifically, the paper compares the performance of the BLAS - based M2L translation operator and the FFT - based M2L translation operator in the kernel - independent Fast Multipole Method (kiFMM), and explores the feasibility of BLAS - M2L in three - dimensional problems. ### Detailed Explanation 1. **Background Problems**: - The development trend of modern computer hardware makes algorithm implementation more inclined to minimize data movement, even if this means more floating - point operations are required. - The Fast Multipole Method (FMM) is an algorithm used to accelerate the solution of N - body potential energy problems, and its complexity is reduced from O(NM) to O(P(N + M)), where P is a small constant that controls precision. - In kiFMM, the M2L operation usually depends on the Fast Fourier Transform (FFT), but in recent years, BLAS - based M2L translation operators have also become popular, and these operators rely on direct matrix compression techniques. 2. **Research Objectives**: - Evaluate the performance of the BLAS - based M2L translation operator in three - dimensional problems, especially for the Laplace kernel. - Compare the performance of BLAS - M2L and FFT - M2L in kiFMM, especially on modern processor architectures. 3. **Main Contributions**: - Propose algorithmic improvements to the BLAS - M2L translation operator to increase cache reuse rate and maximize algorithmic strength. - Conduct extensive benchmark tests using a custom high - performance kiFMM implementation, comparing the performance of FFT - M2L and BLAS - M2L under the three - dimensional Laplace kernel. 4. **Research Methods**: - Develop a high - performance kiFMM implementation in the Rust language, allowing flexible switching between BLAS - and FFT - based translation operators. - Evaluate the performance of the two translation operators through single - node benchmark tests and analyze the results. ### Conclusion This paper aims to explore the competitiveness of BLAS - M2L relative to FFT - M2L on modern computer architectures by optimizing the BLAS - M2L translation operator, especially in three - dimensional problems. Research shows that, with appropriate optimization, BLAS - M2L can achieve performance comparable to or even better than FFT - M2L in some cases.

M2L Translation Operators for Kernel Independent Fast Multipole Methods on Modern Architectures

An Implementation of Parallel MLFMA on a Cluster of Computers with Distributed Memory

Automatic Synthesis of Low-Complexity Translation Operators for the Fast Multipole Method

High Performance Optimizations For Nuclear Physics Code Mfdn On Knl

A SVD accelerated kernel-independent fast multipole method and its application to BEM

Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering

Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods

Generating Families of Practical Fast Matrix Multiplication Algorithms

An Open-Source Framework for Efficient Numerically-Tailored Computations

FMMTL: FMM Template Library A Generalized Framework for Kernel Matrices

Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations

Performant low-order matrix-free finite element kernels on GPU architectures

MFFT: A GPU Accelerated Highly Efficient Mixed-precision Large-scale FFT Framework

Matrix-free approaches for GPU acceleration of a high-order finite element hydrodynamics application using MFEM, Umpire, and RAJA

Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture

A performance portable, fully implicit Landau collision operator with batched linear solvers

Code Generation and Performance Engineering for Matrix-Free Finite Element Methods on Hybrid Tetrahedral Grids

Smat: An Input Adaptive Auto-Tuner For Sparse Matrix-Vector Multiplication

Data-Driven Execution of Fast Multipole Methods

A study of vectorization for matrix-free finite element methods

High Performance Evaluation of the Interpolations and Anterpolations in the GPU-Accelerated Massively Parallel MLFMA