Bayesian Optimization of Catalysts With In-context Learning

Mayk Caldas Ramos,Shane S. Michtavy,Marc D. Porosoff,Andrew D. White
2023-04-12
Abstract:Large language models (LLMs) are able to do accurate classification with zero or only a few examples (in-context learning). We show a prompting system that enables regression with uncertainty for in-context learning with frozen LLM (GPT-3, GPT-3.5, and GPT-4) models, allowing predictions without features or architecture tuning. By incorporating uncertainty, our approach enables Bayesian optimization for catalyst or molecule optimization using natural language, eliminating the need for training or simulation. Here, we performed the optimization using the synthesis procedure of catalysts to predict properties. Working with natural language mitigates difficulty synthesizability since the literal synthesis procedure is the model's input. We showed that in-context learning could improve past a model context window (maximum number of tokens the model can process at once) as data is gathered via example selection, allowing the model to scale better. Although our method does not outperform all baselines, it requires zero training, feature selection, and minimal computing while maintaining satisfactory performance. We also find Gaussian Process Regression on text embeddings is strong at Bayesian optimization. The code is available in our GitHub repository: <a class="link-external link-https" href="https://github.com/ur-whitelab/BO-LIFT" rel="external noopener nofollow">this https URL</a>
Chemical Physics,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Uncertainty Prediction in Catalyst Optimization**: The paper demonstrates how to use large language models (LLMs) for regression prediction in natural language and optimize the design of catalysts or molecules by introducing uncertainty. This method simplifies the optimization process by eliminating the need for feature selection or architecture adjustments. 2. **Scaling Data in In-Context Learning (ICL)**: The study finds that selectively adding examples during in-context learning can improve model performance, enabling it to handle thousands of examples without losing accuracy. This addresses the issue of context window limitations in traditional methods. 3. **Application of Bayesian Optimization**: By combining the uncertainty information generated by LLMs with Bayesian optimization, it is possible to optimize the synthesis conditions of catalysts without training. This method not only reduces the number of experiments but also improves optimization efficiency. Specifically, the authors used the GPT series models (such as GPT-3, GPT-3.5, GPT-4) and predicted the performance of catalysts by describing the synthesis steps in natural language. This approach avoids complex feature engineering and directly uses natural language descriptions as input, making the model easier to understand and apply. Additionally, the paper explores the performance of different models in predicting drug solubility and methane oxidative coupling reaction yields, comparing them with other baseline models.