Robust Training of Machine Learning Interatomic Potentials with Dimensionality Reduction and Stratified Sampling

Ji Qi,Tsz Wai Ko,Brandon C. Wood,Tuan Anh Pham,Shyue Ping Ong
2023-07-25
Abstract:Machine learning interatomic potentials (MLIPs) enable the accurate simulation of materials at larger sizes and time scales, and play increasingly important roles in the computational understanding and design of materials. However, MLIPs are only as accurate and robust as the data they are trained on. In this work, we present DImensionality-Reduced Encoded Clusters with sTratified (DIRECT) sampling as an approach to select a robust training set of structures from a large and complex configuration space. By applying DIRECT sampling on the Materials Project relaxation trajectories dataset with over one million structures and 89 elements, we develop an improved materials 3-body graph network (M3GNet) universal potential that extrapolate more reliably to unseen structures. We further show that molecular dynamics (MD) simulations with universal potentials such as M3GNet can be used in place of expensive \textit{ab initio} MD to rapidly create a large configuration space for target materials systems. Combined with DIRECT sampling, we develop a highly reliable moment tensor potential for Ti-H system without the need for iterative optimization. This work paves the way towards robust high throughput development of MLIPs across any compositional complexity.
Materials Science
What problem does this paper attempt to address?
The paper attempts to address the issue of generating training datasets for Machine Learning Interatomic Potentials (MLIPs). Specifically, the authors propose a method called DImensionality-Reduced Encoded Clusters with sTratified (DIRECT) sampling to select a robust training structure set from a large and complex configuration space. Through this method, the authors aim to improve the reliability and accuracy of MLIPs when dealing with unseen structures. The main contributions of the paper include: 1. **Proposing the DIRECT sampling method**: By steps such as dimensionality reduction, characterization, clustering, and stratified sampling, the method efficiently selects training datasets from large-scale structure databases, ensuring broad coverage of the configuration space. 2. **Improving the M3GNet universal potential**: Using the DIRECT sampling method, the authors trained on over 1 million structures from the Materials Project, developing an improved M3GNet universal potential that enhances performance in predicting material structures and dynamic properties. 3. **Application to the titanium hydride system**: Combining the M3GNet universal potential and the DIRECT sampling method, the authors developed a reliable Moment Tensor Potential (MTP) to study hydrogen diffusion behavior in the titanium hydride (Ti-H) system, demonstrating the method's effectiveness in practical applications. Through these efforts, the authors pave the way for the efficient and high-throughput development of MLIPs suitable for any chemical composition, particularly when dealing with highly complex configuration spaces, significantly reducing the number of active learning iterations and computational costs.