Abstract:Binding affinity and molecular property prediction are crucial for drug discovery. Over the years, deep learning models have been widely used for these tasks; however, large datasets are often needed to achieve strong performances. Pre-training models on vast unlabelled data has emerged as a method to extract contextualised embeddings that boost performance on smaller datasets. SMILES (Simplified Molecular Input Line Entry System) encode molecular structures as strings, making them suitable for natural language processing (NLP). Transformers, known for capturing long-range dependencies, are well-suited for processing SMILES. One such transformer-based architecture is BERT (Bidirectional Encoder Representations from Transformers), which only uses the encoder part of the Transformer and performs classification and regression tasks. Pre-trained transformer-based architectures using SMILES have significantly improved predictions on smaller datasets. Public data repositories such as PubChem, which provide SMILES and physicochemical properties among other data, are essential for pre-training these models. SMILES embeddings that combine chemical structure and physicochemical properties information could further enhance performance on tasks such as binding affinity prediction. For this reason, we introduce Smile-to-Bert, a pre-trained BERT architecture that predicts seven physicochemical properties from SMILES using PubChem data and two different SMILES tokenizers. Moreover, this model generates embeddings that integrate information about molecular structure and physicochemical properties. Regarding the prediction of physicochemical properties, the mean absolute errors obtained are: H-bond acceptors (0.0502), H-bond donors (0.0048), rotatable bonds (0.0949), exact mass (0.5678), TPSA (0.4961), heavy atom count (0.0345), and log-P (0.2219). Additionally, the usefulness of the generated embeddings is evaluated using two binding affinity datasets and their performance is compared to embeddings created by a state-of-the-art Transformer. We show that the SmilesPE tokenizer works better than the atom-level one and that integrating the embeddings generated by Smile-to-Bert to the state-of-the-art Transformer embeddings improves the prediction of binding affinity in one of the datasets. A dashboard for the prediction of physicochemical properties is available at http://147.83.252.32:8050/, and the code is accessible at https://github.com/m-baralt/smile-to-bert.

Knowledge-based BERT: a Method to Extract Molecular Features Like Computational Chemists

Knowledge-based BERT: a method to extract molecular features such as computational chemists

Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration

MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction

A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence

Smile-to-Bert: A BERT architecture trained for physicochemical properties prediction and SMILES embeddings generation

SMG-BERT: integrating stereoscopic information and chemical representation for molecular property prediction

Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules

Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

Ensemble Model With Bert,Roberta and Xlnet For Molecular property prediction

Transfer Learning across Different Chemical Domains: Virtual Screening of Organic Materials with Deep Learning Models Pretrained on Small Molecule and Chemical Reaction Data

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations

A merged molecular representation learning for molecular properties prediction with a web-based service

Impact of Domain Knowledge and Multi-Modality on Intelligent Molecular Property Prediction: A Systematic Survey

Absorption Distribution Metabolism Excretion and Toxicity Property Prediction Utilizing a Pre-Trained Natural Language Processing Model and Its Applications in Early-Stage Drug Development

BERT Learns (and Teaches) Chemistry

A Novel Molecular Representation Learning for Molecular Property Prediction with a Multiple SMILES-Based Augmentation

Advanced deep learning methods for molecular property prediction

Using pretraining and text mining methods to automatically extract the chemical scientific data

Enhancing Molecular Property Prediction through Task-Oriented Transfer Learning: Integrating Universal Structural Insights and Domain-Specific Knowledge