Abstract:Binding affinity and molecular property prediction are crucial for drug discovery. Over the years, deep learning models have been widely used for these tasks; however, large datasets are often needed to achieve strong performances. Pre-training models on vast unlabelled data has emerged as a method to extract contextualised embeddings that boost performance on smaller datasets. SMILES (Simplified Molecular Input Line Entry System) encode molecular structures as strings, making them suitable for natural language processing (NLP). Transformers, known for capturing long-range dependencies, are well-suited for processing SMILES. One such transformer-based architecture is BERT (Bidirectional Encoder Representations from Transformers), which only uses the encoder part of the Transformer and performs classification and regression tasks. Pre-trained transformer-based architectures using SMILES have significantly improved predictions on smaller datasets. Public data repositories such as PubChem, which provide SMILES and physicochemical properties among other data, are essential for pre-training these models. SMILES embeddings that combine chemical structure and physicochemical properties information could further enhance performance on tasks such as binding affinity prediction. For this reason, we introduce Smile-to-Bert, a pre-trained BERT architecture that predicts seven physicochemical properties from SMILES using PubChem data and two different SMILES tokenizers. Moreover, this model generates embeddings that integrate information about molecular structure and physicochemical properties. Regarding the prediction of physicochemical properties, the mean absolute errors obtained are: H-bond acceptors (0.0502), H-bond donors (0.0048), rotatable bonds (0.0949), exact mass (0.5678), TPSA (0.4961), heavy atom count (0.0345), and log-P (0.2219). Additionally, the usefulness of the generated embeddings is evaluated using two binding affinity datasets and their performance is compared to embeddings created by a state-of-the-art Transformer. We show that the SmilesPE tokenizer works better than the atom-level one and that integrating the embeddings generated by Smile-to-Bert to the state-of-the-art Transformer embeddings improves the prediction of binding affinity in one of the datasets. A dashboard for the prediction of physicochemical properties is available at http://147.83.252.32:8050/, and the code is accessible at https://github.com/m-baralt/smile-to-bert.

GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction

Exploring the Potential of Large Language Models in Molecular Tasks: An Insightful Evaluation with GPT‐4

The Future of Molecular Studies Through the Lens of Large Language Models.

PeptideBERT: A Language Model based on Transformers for Peptide Property Prediction

MolXPT: Wrapping Molecules with Text for Generative Pre-training

MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction

Can Large Language Models Empower Molecular Property Prediction?

MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

FG-BERT: a Generalized and Self-Supervised Functional Group-Based Molecular Representation Learning Framework for Properties Prediction.

MolPROP: Molecular Property prediction with multimodal language and graph fusion

SMG-BERT: integrating stereoscopic information and chemical representation for molecular property prediction

Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective

MolRoPE-BERT: an Enhanced Molecular Representation with Rotary Position Embedding for Molecular Property Prediction

KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular Property Prediction

MolGPT: Molecular Generation Using a Transformer-Decoder Model

ChemBERTa-2: Towards Chemical Foundation Models

Molecular Property Prediction by Combining LSTM and GAT

Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration

GIT-Mol: A multi-modal large language model for molecular science with graph, image, and text

Smile-to-Bert: A BERT architecture trained for physicochemical properties prediction and SMILES embeddings generation