Can Large Language Models Understand Molecules?

Shaghayegh Sadeghi,Alan Bui,Ali Forooghi,Jianguo Lu,Alioune Ngom
2024-05-21
Abstract:Purpose: Large Language Models (LLMs) like GPT (Generative Pre-trained Transformer) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations.
Biomolecules,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper investigates the capability of large language models (LLMs) such as GPT and LLaMA in understanding and generating molecular embeddings using the SMILES representation. The study compares the performance of these models with pre-trained models in tasks of molecular property prediction and drug-drug interaction prediction. The results indicate that the SMILES embeddings generated by LLaMA outperform GPT in these tasks and perform similarly to models specifically pre-trained for SMILES. Additionally, the research finds that newer versions of LLMs generally perform better than older versions, despite being trained on more general tasks. The paper also highlights the close performance of LLaMA and LLaMA2 in embedding quality. Overall, the study emphasizes the potential of LLMs in the field of molecular representation and encourages further exploration of these models in chemical tasks.