Leveraging High-throughput Molecular Simulations and Machine Learning for Formulation Design
Alex K. Chew,Mohammad Atif Faiz Afzal,Zach Kaplan,Eric M. Collins,Suraj Gattani,Mayank Misra,Anand Chandrasekaran,Karl Leswing,Mathew D. Halls
DOI: https://doi.org/10.26434/chemrxiv-2024-4lff6
2024-06-18
Abstract:Formulations, or mixtures of chemical ingredients, are ubiquitously found across material science applications, such as themoplastics, consumer packaged goods, and energy storage devices. However, finding formulations with optimal properties is difficult because of the non-obvious connection between the individual ingredient structures and compositions to downstream mixture properties. Computational approaches that could traverse the expansive design space offer a promising solution to finding formulations with improved properties while minimizing the number of experiments. In this work, we generated a large formulation dataset using high-throughput classical molecular dynamics simulations that resulted in more than 30,000 solvent mixtures ranging between pure component to five component systems. We developed three formulation-property relationship approaches to create machine learning models which use the ingredient structure and composition as input to predict a formulation property: formulation descriptor aggregation (FDA), formulation descriptor Set2Set (FDS2S), and formulation graph (FG). We found that FDS2S, a new approach that uses a Set2Set layer to aggregate molecular descriptors of individual ingredients, outperforms all other approaches in accurately predicting density, heat of vaporization, and enthalpy of mixing that were computed from molecular simulations. Feature importance analysis of FDA models reveal that specific substructures are important to predicting these formulation properties, which is useful in the design of formulations to achieve target properties. When leveraging an active learning framework to iteratively suggest the next ingredient and composition to experiment on, we found that formulation-property relationships can identify formulations with the highest property values at least two to three times faster than randomly guessing. The results demonstrate that formulation-property relationships provide valuable insight to suggest the next experiment even when starting from a limited dataset of ~100 examples. Our research demonstrates the utility of high-throughput simulations and machine learning algorithms applied to designing formulations with promising properties, which could broadly accelerate the design of new materials for a wide range of applications, such as improving the performance of liquid electrolytes for batteries, fuel mixtures for oil and gas, solvent additives for perfumes or paints, and more.
Chemistry