The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability

Markus Höhnerbach,Ahmed E. Ismail,Paolo Bientinesi
DOI: https://doi.org/10.48550/arXiv.1607.02904
2016-07-11
Abstract:Molecular dynamics simulations, an indispensable research tool in computational chemistry and materials science, consume a significant portion of the supercomputing cycles around the world. We focus on multi-body potentials and aim at achieving performance portability. Compared with well-studied pair potentials, multibody potentials deliver increased simulation accuracy but are too complex for effective compiler optimization. Because of this, achieving cross-platform performance remains an open question. By abstracting from target architecture and computing precision, we develop a vectorization scheme applicable to both CPUs and accelerators. We present results for the Tersoff potential within the molecular dynamics code LAMMPS on several architectures, demonstrating efficiency gains not only for computational kernels, but also for large-scale simulations. On a cluster of Intel Xeon Phi's, our optimized solver is between 3 and 5 times faster than the pure MPI reference.
Computational Engineering, Finance, and Science,Distributed, Parallel, and Cluster Computing,Mathematical Software,Performance
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the efficient cross - platform implementation of the many - body potential energy function in molecular dynamics simulations. Specifically, the author focuses on the performance optimization and portability of the Tersoff many - body potential energy function on different hardware architectures. Compared with the pairwise potential energy functions that have been widely studied, although the many - body potential energy functions can provide more accurate simulation results, due to their complexity, they are difficult to be effectively optimized by compilers, resulting in cross - platform performance optimization becoming an open problem. By abstracting the target architecture and computational precision, the author has developed a vectorization scheme applicable to CPUs and accelerators and implemented an optimized version of the Tersoff potential energy function in the LAMMPS molecular dynamics code. The experimental results show that on the Intel Xeon Phi cluster, the optimized solver is 3 to 5 times faster than the pure MPI reference implementation. ### Main problem summary 1. **Complexity of many - body potential energy functions**: Many - body potential energy functions (such as Tersoff potential) are more complex than pairwise potential energy functions (such as Lennard - Jones potential) and are difficult to be effectively optimized by compilers. 2. **Cross - platform performance optimization**: How to achieve efficient many - body potential energy function calculations on different hardware architectures while maintaining performance portability. ### Solutions 1. **Vectorization scheme**: By abstracting the target architecture and computational precision, a vectorization scheme applicable to CPUs and accelerators has been developed. 2. **Optimized implementation**: An optimized version of the Tersoff potential energy function has been implemented in the LAMMPS molecular dynamics code, including optimization techniques such as pre - computing derivatives, avoiding masks or divergence, and filtering neighborhood lists. 3. **Performance evaluation**: Performance evaluations have been carried out on multiple architectures, including ARM, from Westmere to Broadwell, Nvidia Tesla (from Kepler generation), and two generations of Intel Xeon Phi (Knights Corner and Knights Landing), showing significant performance improvements. ### Specific technical details 1. **Pre - computing derivatives**: - By pre - computing ζ and its derivatives in the first loop, repeated calculations are reduced. - Additional storage of derivatives for each k is required, but performance can be significantly improved. 2. **Vectorization options**: - **Scheme (1a)**: Map the outer loop I to parallel execution and the inner loop J to data parallel. - **Scheme (1b)**: Fuse the outer loop I and the inner loop J and map them to data parallel. - **Scheme (1c)**: Map the outer loop I to parallel execution and data parallel, and map the inner loop J to sequential execution. 3. **Avoiding masks or divergence**: - By independently controlling the iteration indices of each vector channel, ensure that as many channels as possible participate in the calculation of the numerical kernel and reduce resource waste. 4. **Filtering neighborhood lists**: - Filter the neighborhood list in the scalar - segment program to reduce unnecessary calculations, especially with significant effects on the AVX architecture. ### Experimental results - On the Intel Xeon Phi cluster, the optimized solver is 3 to 5 times faster than the pure MPI reference implementation. - On different architectures, depending on the specific architecture and benchmark test, the reported speedup ratios range from 2 to 8 times. ### Conclusion This paper has successfully solved the problem of efficient cross - platform implementation of many - body potential energy functions on different hardware architectures through a series of optimization techniques, providing an important performance improvement for molecular dynamics simulations.