GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction

Suryanarayanan Balaji,Rishikesh Magar,Yayati Jadhav,Amir Barati Farimani
2023-10-11
Abstract:With the emergence of Transformer architectures and their powerful understanding of textual data, a new horizon has opened up to predict the molecular properties based on text description. While SMILES are the most common form of representation, they are lacking robustness, rich information and canonicity, which limit their effectiveness in becoming generalizable representations. Here, we present GPT-MolBERTa, a self-supervised large language model (LLM) which uses detailed textual descriptions of molecules to predict their properties. A text based description of 326000 molecules were collected using ChatGPT and used to train LLM to learn the representation of molecules. To predict the properties for the downstream tasks, both BERT and RoBERTa models were used in the finetuning stage. Experiments show that GPT-MolBERTa performs well on various molecule property benchmarks, and approaching state of the art performance in regression tasks. Additionally, further analysis of the attention mechanisms show that GPT-MolBERTa is able to pick up important information from the input textual data, displaying the interpretability of the model.
Chemical Physics,Machine Learning
What problem does this paper attempt to address?
The main objective of this paper is to propose a new method for predicting molecular properties. This method is based on using detailed textual descriptions to represent molecules and leveraging these descriptions to train large language models for molecular property prediction. Specifically, the paper introduces a new model called GPT-MolBERTa, which is a self-supervised large-scale language model (LLM) capable of predicting molecular properties using detailed textual descriptions of molecules. To generate these textual descriptions, the authors used the ChatGPT generator based on a dataset of approximately 326,000 molecules. These textual descriptions include information about the functional groups, shapes, and chemical properties of the molecules and are used to train a model based on the RoBERTa architecture. In downstream tasks, i.e., molecular property prediction tasks, the model is fine-tuned by adding regression or classification heads. Experimental results show that GPT-MolBERTa performs well on multiple benchmark datasets, particularly approaching state-of-the-art levels in regression tasks. Notably, compared to some other models that require pre-training on millions of molecular data points, GPT-MolBERTa was pre-trained using only about 300,000 molecular data points, indicating the potential of using more extensive textual descriptions for pre-training to improve molecular property prediction performance. Additionally, the paper discusses some features of GPT-MolBERTa, such as interpretability achieved through attention mechanisms and a comparative analysis of the model's performance on different tasks. Overall, this work demonstrates the effectiveness and potential value of using textual descriptions to represent molecules and predict their properties.