Can Large Language Models Empower Molecular Property Prediction?

Chen Qian,Huayi Tang,Zhirui Yang,Hong Liang,Yong Liu
2023-07-15
Abstract:Molecular property prediction has gained significant attention due to its transformative potential in multiple scientific disciplines. Conventionally, a molecule graph can be represented either as a graph-structured data or a SMILES text. Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP. Although it is natural to utilize LLMs to assist in understanding molecules represented by SMILES, the exploration of how LLMs will impact molecular property prediction is still in its early stage. In this work, we advance towards this objective through two perspectives: zero/few-shot molecular classification, and using the new explanations generated by LLMs as representations of molecules. To be specific, we first prompt LLMs to do in-context molecular classification and evaluate their performance. After that, we employ LLMs to generate semantically enriched explanations for the original SMILES and then leverage that to fine-tune a small-scale LM model for multiple downstream tasks. The experimental results highlight the superiority of text explanations as molecular representations across multiple benchmark datasets, and confirm the immense potential of LLMs in molecular property prediction tasks. Codes are available at \url{<a class="link-external link-https" href="https://github.com/ChnQ/LLM4Mol" rel="external noopener nofollow">this https URL</a>}.
Machine Learning,Artificial Intelligence,Quantitative Methods
What problem does this paper attempt to address?
The paper primarily explores the potential application of large language models (LLMs) in molecular property prediction tasks. Specifically, the authors investigate from two perspectives: 1. **Zero-shot/Few-shot Molecular Classification**: Utilizing the powerful contextual learning capabilities of large language models, by designing appropriate prompts, the model can directly classify molecules without the need for additional parameter updates. 2. **Generating New Molecular Representations**: By having large language models generate detailed textual descriptions for the Simplified Molecular Input Line Entry System (SMILES) representation of molecules (referred to as "Caption as new Representation," abbreviated as CaR), these descriptions include information about the functional groups and chemical properties of the molecules. These descriptions are then used as new representations of the molecules to assist downstream tasks. Experimental results show that on multiple benchmark datasets, this new method achieves significantly better performance compared to traditional Graph Neural Networks (GNNs) and SMILES-based methods under random split settings. Additionally, the paper discusses some limitations and future research directions, such as exploring more diverse large language models, better utilizing the graph structure information of molecules, and handling macromolecules that cannot be represented by SMILES.