MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction

Yuyan Liu,Sirui Ding,Sheng Zhou,Wenqi Fan,Qiaoyu Tan
2024-10-18
Abstract:Molecular property prediction (MPP) is a fundamental and crucial task in drug discovery. However, prior methods are limited by the requirement for a large number of labeled molecules and their restricted ability to generalize for unseen and new tasks, both of which are essential for real-world applications. To address these challenges, we present MolecularGPT for few-shot MPP. From a perspective on instruction tuning, we fine-tune large language models (LLMs) based on curated molecular instructions spanning over 1000 property prediction tasks. This enables building a versatile and specialized LLM that can be adapted to novel MPP tasks without any fine-tuning through zero- and few-shot in-context learning (ICL). MolecularGPT exhibits competitive in-context reasoning capabilities across 10 downstream evaluation datasets, setting new benchmarks for few-shot molecular prediction tasks. More importantly, with just two-shot examples, MolecularGPT can outperform standard supervised graph neural network methods on 4 out of 7 datasets. It also excels state-of-the-art LLM baselines by up to 15.7% increase on classification accuracy and decrease of 17.9 on regression metrics (e.g., RMSE) under zero-shot. This study demonstrates the potential of LLMs as effective few-shot molecular property predictors. The code is available at <a class="link-external link-https" href="https://github.com/NYUSHCS/MolecularGPT" rel="external noopener nofollow">this https URL</a>.
Quantitative Methods,Artificial Intelligence,Computational Engineering, Finance, and Science,Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address several key challenges in Molecular Property Prediction (MPP): 1. **High demand for labeled data**: Traditional MPP methods require a large amount of labeled molecular data to train models, which is both expensive and time - consuming in practical applications. 2. **Limited generalization ability**: Existing supervised learning methods are usually optimized for specific tasks, resulting in poor performance when dealing with new or unseen tasks, which limits their application in open - world scenarios. 3. **Insufficient zero - shot and few - shot reasoning ability**: Existing methods perform poorly in zero - shot and few - shot situations and cannot make effective predictions by fully utilizing a small amount of labeled data. To meet these challenges, the paper proposes MolecularGPT, a large - scale language model (LLM) through instruction tuning, aiming to achieve the following goals: - **Few - shot molecular property prediction**: MolecularGPT can adapt to new MPP tasks through zero - shot and few - shot In - Context Learning (ICL) without additional fine - tuning. - **Structure - aware instruction design**: A structure - aware few - shot instruction design strategy is introduced, which utilizes the similarity between molecules to enhance the model's reasoning ability. - **Mixed instruction set**: By combining zero - shot and few - shot instructions, a mixed instruction set is constructed to balance zero - shot and few - shot reasoning abilities and improve the overall performance of the model on different tasks. Specifically, MolecularGPT adopts the following technical means: - **SMILES representation**: Use SMILES strings as a unified representation of molecular graphs, convert the chemical structure of molecules into strings of atomic symbols and chemical bonds, ensuring that different types of molecules can be input into the model in a consistent manner. - **Structure - aware few - shot instructions**: By retrieving the nearest neighbor molecules of each query molecule and including them as examples in the instructions, the model's understanding and utilization of molecular structure information are enhanced. - **Mixed instruction set**: Combine zero - shot and few - shot instructions to construct a diverse instruction set covering more than 1,000 MPP tasks, including classification and regression tasks, thereby improving the model's generalization ability and in - context learning ability. Experimental results show that MolecularGPT performs excellently in multiple molecular property prediction benchmark tests, especially in few - shot and zero - shot situations, significantly outperforming existing GNN methods and other LLM baseline models.