BatGPT-Chem: A Foundation Large Model For Chemical Engineering

Yifei Yang,Runhan Shi,Zuchao Li,Shu Jiang,Yang Yang,Bao-Liang Lu,Hai Zhao
DOI: https://doi.org/10.26434/chemrxiv-2024-1p4xt
2024-04-17
Abstract:LLMs have showcased remarkable capabilities in the realm of AI for Science (Ai4Sci) and the chemistry has greatly benefited from the advancement of AI tools. With a strong capacity for learning sequential data like natural language, LLMs offer immense potential. Notably, common representations in chemistry, such as SMILES, are also in the form of sequences. Hence, we propose leveraging LLMs to comprehensively model both chemical sequences and natural language sequences, aiming to tackle diverse chemical tasks. To fulfill this objective, we introduce BatGPT-Chem, a foundational large-scale model with 15B parameters tailored for chemical engineering. First, we unify diverse tasks in chemistry by modeling them through a combination of natural language and SMILES. Next, leveraging this unified modeling approach, we craft prompt templates and generate instructional tuning data using a substantial volume of chemical data. Subsequently, we train BatGPT-15B on over a hundred million instances of instructional tuning data, empowering it to address tasks such as \textbf{Molecule Description}, \textbf{Molecule Design}, \textbf{Retro-synthesis Prediction}, \textbf{Product Inference}, and \textbf{Yield Prediction}. We release our trial platform at \url{https://www.batgpt.net/dapp/chem}.
Chemistry
What problem does this paper attempt to address?
The paper aims to address several key issues in the field of chemical engineering, particularly by leveraging large language models (LLMs) to handle chemical tasks. Specifically, the research team proposed BatGPT-Chem, a foundational large model for chemical engineering with 15 billion parameters. The goal of this model is to solve diverse chemical tasks by unifying the modeling of natural language and Simplified Molecular Input Line Entry System (SMILES). To achieve this goal, the authors first unified different chemical tasks (such as molecular description, molecular design, retrosynthesis prediction, product inference, and yield prediction) into a single framework by combining natural language and SMILES for modeling. Next, they designed instruction tuning templates and generated a large number of instruction tuning datasets. These datasets were constructed based on a vast amount of open-source and private chemical data. Finally, the research team trained the BatGPT-15B model on over 100 million instruction tuning data points, enabling it to perform various chemical tasks. The paper also details the methods used, including an introduction to SMILES notation, the unified modeling strategy, the design of chemical tasks and corresponding prompt templates, and the data sources used for training. Additionally, the authors reported experimental results of the model on the retrosynthesis prediction task, demonstrating its performance in terms of coverage and effectiveness. In summary, this research work proposes a novel approach that leverages the powerful capabilities of large language models to solve complex problems in the field of chemistry, with significant application value and academic significance.