Abstract:Molecular dynamics simulations, an indispensable research tool in computational chemistry and materials science, consume a significant portion of the supercomputing cycles around the world. We focus on multi-body potentials and aim at achieving performance portability. Compared with well-studied pair potentials, multibody potentials deliver increased simulation accuracy but are too complex for effective compiler optimization. Because of this, achieving cross-platform performance remains an open question. By abstracting from target architecture and computing precision, we develop a vectorization scheme applicable to both CPUs and accelerators. We present results for the Tersoff potential within the molecular dynamics code LAMMPS on several architectures, demonstrating efficiency gains not only for computational kernels, but also for large-scale simulations. On a cluster of Intel Xeon Phi's, our optimized solver is between 3 and 5 times faster than the pure MPI reference.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the efficient cross - platform implementation of the many - body potential energy function in molecular dynamics simulations. Specifically, the author focuses on the performance optimization and portability of the Tersoff many - body potential energy function on different hardware architectures. Compared with the pairwise potential energy functions that have been widely studied, although the many - body potential energy functions can provide more accurate simulation results, due to their complexity, they are difficult to be effectively optimized by compilers, resulting in cross - platform performance optimization becoming an open problem. By abstracting the target architecture and computational precision, the author has developed a vectorization scheme applicable to CPUs and accelerators and implemented an optimized version of the Tersoff potential energy function in the LAMMPS molecular dynamics code. The experimental results show that on the Intel Xeon Phi cluster, the optimized solver is 3 to 5 times faster than the pure MPI reference implementation. ### Main problem summary 1. **Complexity of many - body potential energy functions**: Many - body potential energy functions (such as Tersoff potential) are more complex than pairwise potential energy functions (such as Lennard - Jones potential) and are difficult to be effectively optimized by compilers. 2. **Cross - platform performance optimization**: How to achieve efficient many - body potential energy function calculations on different hardware architectures while maintaining performance portability. ### Solutions 1. **Vectorization scheme**: By abstracting the target architecture and computational precision, a vectorization scheme applicable to CPUs and accelerators has been developed. 2. **Optimized implementation**: An optimized version of the Tersoff potential energy function has been implemented in the LAMMPS molecular dynamics code, including optimization techniques such as pre - computing derivatives, avoiding masks or divergence, and filtering neighborhood lists. 3. **Performance evaluation**: Performance evaluations have been carried out on multiple architectures, including ARM, from Westmere to Broadwell, Nvidia Tesla (from Kepler generation), and two generations of Intel Xeon Phi (Knights Corner and Knights Landing), showing significant performance improvements. ### Specific technical details 1. **Pre - computing derivatives**: - By pre - computing ζ and its derivatives in the first loop, repeated calculations are reduced. - Additional storage of derivatives for each k is required, but performance can be significantly improved. 2. **Vectorization options**: - **Scheme (1a)**: Map the outer loop I to parallel execution and the inner loop J to data parallel. - **Scheme (1b)**: Fuse the outer loop I and the inner loop J and map them to data parallel. - **Scheme (1c)**: Map the outer loop I to parallel execution and data parallel, and map the inner loop J to sequential execution. 3. **Avoiding masks or divergence**: - By independently controlling the iteration indices of each vector channel, ensure that as many channels as possible participate in the calculation of the numerical kernel and reduce resource waste. 4. **Filtering neighborhood lists**: - Filter the neighborhood list in the scalar - segment program to reduce unnecessary calculations, especially with significant effects on the AVX architecture. ### Experimental results - On the Intel Xeon Phi cluster, the optimized solver is 3 to 5 times faster than the pure MPI reference implementation. - On different architectures, depending on the specific architecture and benchmark test, the reported speedup ratios range from 2 to 8 times. ### Conclusion This paper has successfully solved the problem of efficient cross - platform implementation of many - body potential energy functions on different hardware architectures through a series of optimization techniques, providing an important performance improvement for molecular dynamics simulations.

The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability

Student Cluster Competition 2017, Team Peking University: Reproducing Vectorization of the Tersoff Multi-Body Potential on the Intel Broadwell Architecture

Student Cluster Competition 2017, Team Tsinghua University: Reproducing Vectorization of the Tersoff Multi-Body Potential on the Intel Skylake and NVIDIA Volta Architectures

Efficient molecular dynamics simulations with many-body potentials on graphics processing units

Code modernization strategies for short-range non-bonded molecular dynamics simulations

Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives

Accelerated molecular dynamics force evaluation on graphics processing units for thermal conductivity calculations.

A Study of Performance Portability in Plasma Physics Simulations

FPGA-Accelerated Tersoff Multi-body Potential for Molecular Dynamics Simulations

Achieving Performance Portability in Gaussian Basis Set Density Functional Theory on Accelerator Based Architectures in NWChemEx

Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures

MOIL-opt: Energy-Conserving Molecular Dynamics on a GPU/CPU System

VecDualSPHysics: a vectorized implementation of Smoothed Particle Hydrodynamics method for simulating fluid flows on multi-core processors

A Lightweight Approach to Performance Portability with targetDP

Porting Molecular Dynamics simulation to heterogeneous multi-core architecture

Evaluating Portable Parallelization Strategies for Heterogeneous Architectures in High Energy Physics

Hybrid programming-model strategies for GPU offloading of electronic structure calculation kernels

Optimizing the Performance of Reactive Molecular Dynamics Simulations for Many-Core Architectures

Parallelization of Kinetic Theory Simulations

Student Cluster Competition 2017, Team University ofTexas at Austin/Texas State University: Reproducing Vectorization of the Tersoff Multi-Body Potential on the Intel Skylake and NVIDIA V100 Architectures