Abstract:Molecular property prediction (MPP) is a fundamental and crucial task in drug discovery. However, prior methods are limited by the requirement for a large number of labeled molecules and their restricted ability to generalize for unseen and new tasks, both of which are essential for real-world applications. To address these challenges, we present MolecularGPT for few-shot MPP. From a perspective on instruction tuning, we fine-tune large language models (LLMs) based on curated molecular instructions spanning over 1000 property prediction tasks. This enables building a versatile and specialized LLM that can be adapted to novel MPP tasks without any fine-tuning through zero- and few-shot in-context learning (ICL). MolecularGPT exhibits competitive in-context reasoning capabilities across 10 downstream evaluation datasets, setting new benchmarks for few-shot molecular prediction tasks. More importantly, with just two-shot examples, MolecularGPT can outperform standard supervised graph neural network methods on 4 out of 7 datasets. It also excels state-of-the-art LLM baselines by up to 15.7% increase on classification accuracy and decrease of 17.9 on regression metrics (e.g., RMSE) under zero-shot. This study demonstrates the potential of LLMs as effective few-shot molecular property predictors. The code is available at <a class="link-external link-https" href="https://github.com/NYUSHCS/MolecularGPT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to address several key challenges in Molecular Property Prediction (MPP): 1. **High demand for labeled data**: Traditional MPP methods require a large amount of labeled molecular data to train models, which is both expensive and time - consuming in practical applications. 2. **Limited generalization ability**: Existing supervised learning methods are usually optimized for specific tasks, resulting in poor performance when dealing with new or unseen tasks, which limits their application in open - world scenarios. 3. **Insufficient zero - shot and few - shot reasoning ability**: Existing methods perform poorly in zero - shot and few - shot situations and cannot make effective predictions by fully utilizing a small amount of labeled data. To meet these challenges, the paper proposes MolecularGPT, a large - scale language model (LLM) through instruction tuning, aiming to achieve the following goals: - **Few - shot molecular property prediction**: MolecularGPT can adapt to new MPP tasks through zero - shot and few - shot In - Context Learning (ICL) without additional fine - tuning. - **Structure - aware instruction design**: A structure - aware few - shot instruction design strategy is introduced, which utilizes the similarity between molecules to enhance the model's reasoning ability. - **Mixed instruction set**: By combining zero - shot and few - shot instructions, a mixed instruction set is constructed to balance zero - shot and few - shot reasoning abilities and improve the overall performance of the model on different tasks. Specifically, MolecularGPT adopts the following technical means: - **SMILES representation**: Use SMILES strings as a unified representation of molecular graphs, convert the chemical structure of molecules into strings of atomic symbols and chemical bonds, ensuring that different types of molecules can be input into the model in a consistent manner. - **Structure - aware few - shot instructions**: By retrieving the nearest neighbor molecules of each query molecule and including them as examples in the instructions, the model's understanding and utilization of molecular structure information are enhanced. - **Mixed instruction set**: Combine zero - shot and few - shot instructions to construct a diverse instruction set covering more than 1,000 MPP tasks, including classification and regression tasks, thereby improving the model's generalization ability and in - context learning ability. Experimental results show that MolecularGPT performs excellently in multiple molecular property prediction benchmark tests, especially in few - shot and zero - shot situations, significantly outperforming existing GNN methods and other LLM baseline models.

MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction

Exploring the Potential of Large Language Models in Molecular Tasks: An Insightful Evaluation with GPT‐4

Can Large Language Models Empower Molecular Property Prediction?

Understanding the Limitations of Deep Models for Molecular Property Prediction: Insights and Solutions.

The Future of Molecular Studies Through the Lens of Large Language Models.

DrugLLM: Open Large Language Model for Few-shot Molecule Generation

Molecular Property Prediction Based on Graph Structure Learning

Fast and Effective Molecular Property Prediction with Transferability Map

Benchmarking Large Language Models for Molecule Prediction Tasks

LGGA-MPP: Local Geometry-Guided Graph Attention for Molecular Property Prediction

Meta Learning with Attention Based FP-GNNs for Few-Shot Molecular Property Prediction

GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction

Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective

Chemical Property Relation Guided Few-Shot Molecular Property Prediction

Molecular Property Prediction by Combining LSTM and GAT

Explaining Graph Neural Networks with Large Language Models: A Counterfactual Perspective for Molecular Property Prediction

KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular Property Prediction

In-Context Learning for Few-Shot Molecular Property Prediction

Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

LSTM-GNN: A Multi-Channel Model for Molecular Properties Prediction

GEP-DL4Mol: A Novel Molecular Deep-learning Model Optimization Framework for Boosting Molecular Properties Prediction*