Large Language Models for Inorganic Synthesis Predictions

Joshua Schrier,Seongmin Kim,Yousung Jung
DOI: https://doi.org/10.26434/chemrxiv-2024-9bmfj-v2
2024-04-29
Abstract:We evaluate the effectiveness of pre-trained and fine-tuned large language models (LLMs) for predicting the synthesizability of inorganic compounds and the selection of precursors needed to perform inorganic synthesis. The predictions of fine-tuned LLMs are comparable to—and sometimes better than—recent bespoke machine learning models for these tasks, but require only minimal user expertise, cost, and time to develop. Therefore, this strategy can serve both as an effective and strong baseline for future machine learning studies of various chemical applications and as a practical tool for experimental chemists.
Chemistry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to use large language models (LLMs) to predict the synthesis possibility of inorganic compounds and select the required precursors. Specifically, the researchers evaluated the performance of pre - trained and fine - tuned large language models in the following two tasks: 1. **Synthesis Possibility Prediction**: Given a chemical formula, predict whether the compound can be synthesized. This is a Positive - Unlabeled (PU) learning problem because the available data set contains known (previously synthesized) compounds and unknown (hypothetical) compounds, and the latter may not be synthesizable. The researchers used data from the Materials Project and the Open Quantum Materials Database to define the possibility set, which contains 393,053 unique inorganic compositions, of which 40,817 compounds have references in the Inorganic Crystal Structure Database (ICSD) and are regarded as positive samples (synthesized), and the remaining 352,236 are regarded as unlabeled samples (hypothetical). 2. **Precursor Selection**: Given the chemical formula of the target compound, predict all the precursors required to synthesize the compound. The output must exactly match the set of precursors in the known synthesis examples because the output is restricted to a predefined precursor list, which is a multi - label prediction problem. The researchers started from the text - oriented synthesis data set of Kononova et al., removed inconsistent or incomplete data, and retained reactions that only contained precursors used in ≥5 example reactions, finally obtaining 11,923 unique reactions and 311 precursors. The researchers used GPT - 3.5 and GPT - 4 as the base models and fine - tuned these models to improve their performance on these two tasks. The results show that the fine - tuned LLMs perform comparably to, and sometimes even better than, the recently developed machine - learning models specifically for such tasks. In addition, this method is simple and low - cost and can be used as a strong baseline method for future machine - learning research, and also provides a practical tool for experimental chemists.