Enhanced sampling of robust molecular datasets with uncertainty-based collective variables

Aik Rui Tan,Johannes C. B. Dietschreit,Rafael Gomez-Bombarelli
2024-02-06
Abstract:Generating a data set that is representative of the accessible configuration space of a molecular system is crucial for the robustness of machine learned interatomic potentials (MLIP). However, the complexity of molecular systems, characterized by intricate potential energy surfaces (PESs) with numerous local minima and energy barriers, presents a significant challenge. Traditional methods of data generation, such as random sampling or exhaustive exploration, are either intractable or may not capture rare, but highly informative configurations. In this study, we propose a method that leverages uncertainty as the collective variable (CV) to guide the acquisition of chemically-relevant data points, focusing on regions of the configuration space where ML model predictions are most uncertain. This approach employs a Gaussian Mixture Model-based uncertainty metric from a single model as the CV for biased molecular dynamics simulations. The effectiveness of our approach in overcoming energy barriers and exploring unseen energy minima, thereby enhancing the data set in an active learning framework, is demonstrated on the alanine dipeptide benchmark system.
Machine Learning,Computational Physics
What problem does this paper attempt to address?
This paper aims to address the problem of generating representative datasets in molecular systems, which is crucial for the robustness of machine learning interatomic potentials (MLIP). However, the complexity of molecular systems, such as complex potential energy surfaces (PES) and numerous local minima and energy barriers, makes traditional data generation methods like random sampling or exhaustive exploration difficult and may not capture rare but informative configurations. The paper proposes an approach that utilizes collective variables (CV) based on uncertainty to guide data acquisition, focusing on regions where ML models predict the most uncertain chemical data points. This approach uses the uncertainty measure of a Gaussian mixture model as the bias for biased molecular dynamics simulations of CVs in a single model. On the alanine dipeptide benchmark system, the method demonstrates the ability to overcome energy barriers and explore previously unseen energy minima, thus enhancing the dataset within an active learning framework. In summary, the main contributions of the paper include: 1. Proposing the use of single-model uncertainty as a collective variable for enhanced sampling to create a diverse dataset that covers the configuration space. 2. Demonstrating the effectiveness of the method on the flexible alanine dipeptide molecule with minimal initial training data in an active learning setting. 3. Combining with existing enhanced sampling methods, utilizing MLIP uncertainty estimation as CVs to guide the system to the critical regions where there is a lack of samples in the current training data, thus improving the robustness of MLIP. The paper highlights the advantages of using uncertainty as a guidance strategy for data acquisition in exploring the configuration space of molecular systems, as compared with different data collection strategies, including traditional molecular dynamics and quantum mechanical calculations, random sampling, and uncertainty-guided active learning methods. Additionally, it proposes the use of single-model uncertainty instead of ensemble uncertainty to reduce training and exploration costs.