Data Generation for Machine Learning Interatomic Potentials and Beyond

Maksim Kulichenko,Benjamin Nebgen,Nicholas Lubbers,Justin S Smith,Kipton Barros,Alice E A Allen,Adela Habib,Emily Shinkle,Nikita Fedik,Ying Wai Li,Richard A Messerly,Sergei Tretiak
DOI: https://doi.org/10.1021/acs.chemrev.4c00572
2024-11-21
Abstract:The field of data-driven chemistry is undergoing an evolution, driven by innovations in machine learning models for predicting molecular properties and behavior. Recent strides in ML-based interatomic potentials have paved the way for accurate modeling of diverse chemical and structural properties at the atomic level. The key determinant defining MLIP reliability remains the quality of the training data. A paramount challenge lies in constructing training sets that capture specific domains in the vast chemical and structural space. This Review navigates the intricate landscape of essential components and integrity of training data that ensure the extensibility and transferability of the resulting models. We delve into the details of active learning, discussing its various facets and implementations. We outline different types of uncertainty quantification applied to atomistic data acquisition and the correlations between estimated uncertainty and true error. The role of atomistic data samplers in generating diverse and informative structures is highlighted. Furthermore, we discuss data acquisition via modified and surrogate potential energy surfaces as an innovative approach to diversify training data. The Review also provides a list of publicly available data sets that cover essential domains of chemical space.
What problem does this paper attempt to address?