BatGPT-Chem: A Foundation Large Model For Chemical Engineering

Yifei Yang,Runhan Shi,Zuchao Li,Shu Jiang,Yang Yang,Bao-Liang Lu,Hai Zhao

DOI: https://doi.org/10.26434/chemrxiv-2024-1p4xt

2024-04-17

Abstract:LLMs have showcased remarkable capabilities in the realm of AI for Science (Ai4Sci) and the chemistry has greatly benefited from the advancement of AI tools. With a strong capacity for learning sequential data like natural language, LLMs offer immense potential. Notably, common representations in chemistry, such as SMILES, are also in the form of sequences. Hence, we propose leveraging LLMs to comprehensively model both chemical sequences and natural language sequences, aiming to tackle diverse chemical tasks. To fulfill this objective, we introduce BatGPT-Chem, a foundational large-scale model with 15B parameters tailored for chemical engineering. First, we unify diverse tasks in chemistry by modeling them through a combination of natural language and SMILES. Next, leveraging this unified modeling approach, we craft prompt templates and generate instructional tuning data using a substantial volume of chemical data. Subsequently, we train BatGPT-15B on over a hundred million instances of instructional tuning data, empowering it to address tasks such as \textbf{Molecule Description}, \textbf{Molecule Design}, \textbf{Retro-synthesis Prediction}, \textbf{Product Inference}, and \textbf{Yield Prediction}. We release our trial platform at \url{https://www.batgpt.net/dapp/chem}.

Chemistry

What problem does this paper attempt to address?

The paper aims to address several key issues in the field of chemical engineering, particularly by leveraging large language models (LLMs) to handle chemical tasks. Specifically, the research team proposed BatGPT-Chem, a foundational large model for chemical engineering with 15 billion parameters. The goal of this model is to solve diverse chemical tasks by unifying the modeling of natural language and Simplified Molecular Input Line Entry System (SMILES). To achieve this goal, the authors first unified different chemical tasks (such as molecular description, molecular design, retrosynthesis prediction, product inference, and yield prediction) into a single framework by combining natural language and SMILES for modeling. Next, they designed instruction tuning templates and generated a large number of instruction tuning datasets. These datasets were constructed based on a vast amount of open-source and private chemical data. Finally, the research team trained the BatGPT-15B model on over 100 million instruction tuning data points, enabling it to perform various chemical tasks. The paper also details the methods used, including an introduction to SMILES notation, the unified modeling strategy, the design of chemical tasks and corresponding prompt templates, and the data sources used for training. Additionally, the authors reported experimental results of the model on the retrosynthesis prediction task, demonstrating its performance in terms of coverage and effectiveness. In summary, this research work proposes a novel approach that leverages the powerful capabilities of large language models to solve complex problems in the field of chemistry, with significant application value and academic significance.

BatGPT-Chem: A Foundation Large Model For Chemical Engineering

BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction

ChemDFM: A Large Language Foundation Model for Chemistry

Fine-tuning Large Language Models for Chemical Text Mining

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

Structured Chemistry Reasoning with Large Language Models

Augmenting large language models with chemistry tools

ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models

ChemCrow: Augmenting large-language models with chemistry tools

Accelerated end-to-end chemical synthesis development with large language models

Exploring the Potential of Large Language Models in Molecular Tasks: An Insightful Evaluation with GPT‐4

ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis

LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

Leveraging large language models for predictive chemistry

Unlocking comprehensive molecular design across all scenarios with large language model and unordered chemical language

An Automatic End-to-end Chemical Synthesis Development Platform Powered by Large Language Models

A Large Encoder-Decoder Family of Foundation Models For Chemical Language

Large Language Models are Catalyzing Chemistry Education

Leveraging GPT-4 to transform chemistry from paper to practice