Physics-based Machine Learning to Predict Hydration Free Energies for Small Molecules with a minimal number of descriptors: Interpretable and Accurate

Ajeet Kumar Yadav,Marvin V. Prakash,Pradipta Bandyopadhyay
DOI: https://doi.org/10.26434/chemrxiv-2024-v8h0j
2024-10-25
Abstract:Hydration free energy (HFE) of molecules is a fundamental property having impor- tance throughout chemistry and biology. Calculation of the HFE can be challenging and expensive with classical molecular dynamics simulation-based approaches. Ma- chine learning (ML) models are increasingly being used to predict HFE. Although the accuracy of ML models for datasets for small molecules is impressive, these models suffer from lack of interpretability. In this work, we have developed a physics-based ML model with only six descriptors, which is both accurate and fully interpretable, and applied it to a database for small molecule HFE, FreeSolv. We have evaluated the electrostatic energy by an approximate closed form of the Generalized Born (GB) model and polar surface area. In addition, we have logP and hydrogen bond acceptor and donors as descriptors along with the number of rotatable bonds. We have used different ML models such as random forest and extreme gradient boosting. The best result from these models has a mean absolute error of only 0.74 kcal/mol. The main power of this model is that the descriptors have clear physical meaning and it was found that the descriptor describing the electrostatics and the polar surface area, followed by the hydrogen bond donors and acceptors, are the most important factors for the calculation of hydration free energy.
Chemistry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to accurately predict the hydration free energy (HFE) of small molecules through machine - learning models using a small number of physically interpretable descriptors?** ### Specific problem background: 1. **Importance of hydration free energy**: - Hydration free energy (HFE) is a fundamental property in chemistry and biology and is crucial for understanding the behavior of molecules in solvents. - Traditional methods based on classical molecular dynamics simulations for calculating HFE are both complex and expensive. 2. **Limitations of existing methods**: - Although machine - learning (ML) models perform well in predicting HFE, these models often lack interpretability, making it difficult to understand their working principles or the reasons for errors. ### Goals of the paper: - Develop a physics - based machine - learning model to predict the HFE of small molecules using as few descriptors as possible, ensuring that the model is not only accurate but also fully interpretable. - Use the FreeSolv database for verification, which contains 643 small organic molecules and their experimentally measured HFE values. ### Main contributions: - **Descriptor selection**: Only six descriptors with clear physical meanings are used, including polar surface area, number of hydrogen - bond donors and acceptors, logP, number of rotatable bonds, and a charge term (the sum of the GB term and the Coulomb electrostatic term). - **Model performance**: Different machine - learning models such as Random Forest, XGBoost, Gradient Boosting, and LightGBM are used for training, and the mean absolute error (MAE) of the best result is only 0.74 kcal/mol. - **Interpretability**: Since the descriptors have clear physical meanings, the prediction results of the model can be clearly explained, especially the effects of the charge term and the polar surface area on HFE are the most significant. ### Summary: By introducing a physics - based machine - learning model, the paper has successfully solved the problems of complexity and high cost of traditional methods and the lack of interpretability of existing machine - learning models, providing an efficient and easy - to - understand HFE prediction tool for fields such as drug design.