Leveraging large language models for predictive chemistry

Kevin Maik Jablonka,Philippe Schwaller,Andres Ortega-Guerrero,Berend Smit
DOI: https://doi.org/10.1038/s42256-023-00788-1
IF: 23.8
2024-02-07
Nature Machine Intelligence
Abstract:Machine learning has transformed many fields and has recently found applications in chemistry and materials science. The small datasets commonly found in chemistry sparked the development of sophisticated machine learning approaches that incorporate chemical knowledge for each application and, therefore, require specialized expertise to develop. Here we show that GPT-3, a large language model trained on vast amounts of text extracted from the Internet, can easily be adapted to solve various tasks in chemistry and materials science by fine-tuning it to answer chemical questions in natural language with the correct answer. We compared this approach with dedicated machine learning models for many applications spanning the properties of molecules and materials to the yield of chemical reactions. Surprisingly, our fine-tuned version of GPT-3 can perform comparably to or even outperform conventional machine learning techniques, in particular in the low-data limit. In addition, we can perform inverse design by simply inverting the questions. The ease of use and high performance, especially for small datasets, can impact the fundamental approach to using machine learning in the chemical and material sciences. In addition to a literature search, querying a pre-trained large language model might become a routine way to bootstrap a project by leveraging the collective knowledge encoded in these foundation models, or to provide a baseline for predictive tasks.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?
This paper discusses how to use large-scale language models, such as GPT-3, to address predictive problems in chemistry and materials science. The study found that by fine-tuning GPT-3 to answer chemical-related questions in natural language, it can be adapted to various chemical tasks, including properties of molecules and materials, chemical reaction yields, etc. Even with small amounts of data, the performance of the fine-tuned GPT-3 model is comparable to dedicated machine learning models and even surpasses them in some cases. In the paper, the researchers compared the performance of the GPT-3 model to specialized chemical machine learning models in multiple applications, and conducted reverse design by changing the format of the questions to search for new molecules. The results show that the fine-tuned GPT-3 performs well under low data conditions and is easy to use, which may impact the application of machine learning in the field of chemistry and materials science. In the future, querying pre-trained large-scale language models may become a standard step for project initiation to leverage the collective knowledge encoded in these base models or provide benchmarks for prediction tasks.