Abstract:Binding affinity and molecular property prediction are crucial for drug discovery. Over the years, deep learning models have been widely used for these tasks; however, large datasets are often needed to achieve strong performances. Pre-training models on vast unlabelled data has emerged as a method to extract contextualised embeddings that boost performance on smaller datasets. SMILES (Simplified Molecular Input Line Entry System) encode molecular structures as strings, making them suitable for natural language processing (NLP). Transformers, known for capturing long-range dependencies, are well-suited for processing SMILES. One such transformer-based architecture is BERT (Bidirectional Encoder Representations from Transformers), which only uses the encoder part of the Transformer and performs classification and regression tasks. Pre-trained transformer-based architectures using SMILES have significantly improved predictions on smaller datasets. Public data repositories such as PubChem, which provide SMILES and physicochemical properties among other data, are essential for pre-training these models. SMILES embeddings that combine chemical structure and physicochemical properties information could further enhance performance on tasks such as binding affinity prediction. For this reason, we introduce Smile-to-Bert, a pre-trained BERT architecture that predicts seven physicochemical properties from SMILES using PubChem data and two different SMILES tokenizers. Moreover, this model generates embeddings that integrate information about molecular structure and physicochemical properties. Regarding the prediction of physicochemical properties, the mean absolute errors obtained are: H-bond acceptors (0.0502), H-bond donors (0.0048), rotatable bonds (0.0949), exact mass (0.5678), TPSA (0.4961), heavy atom count (0.0345), and log-P (0.2219). Additionally, the usefulness of the generated embeddings is evaluated using two binding affinity datasets and their performance is compared to embeddings created by a state-of-the-art Transformer. We show that the SmilesPE tokenizer works better than the atom-level one and that integrating the embeddings generated by Smile-to-Bert to the state-of-the-art Transformer embeddings improves the prediction of binding affinity in one of the datasets. A dashboard for the prediction of physicochemical properties is available at http://147.83.252.32:8050/, and the code is accessible at https://github.com/m-baralt/smile-to-bert.

MolRoPE-BERT: an Enhanced Molecular Representation with Rotary Position Embedding for Molecular Property Prediction

Knowledge-based BERT: a Method to Extract Molecular Features Like Computational Chemists

Knowledge-based BERT: a method to extract molecular features such as computational chemists

SMG-BERT: integrating stereoscopic information and chemical representation for molecular property prediction

MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction

Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration

MolPROP: Molecular Property prediction with multimodal language and graph fusion

FG-BERT: a Generalized and Self-Supervised Functional Group-Based Molecular Representation Learning Framework for Properties Prediction.

DMPNN-Bert: a deep learning architecture for molecular property prediction.

A merged molecular representation learning for molecular properties prediction with a web-based service

Geometry-based BERT: an experimentally validated deep learning model for molecular property prediction in drug discovery

Smile-to-Bert: A BERT architecture trained for physicochemical properties prediction and SMILES embeddings generation

MvMRL: a multi-view molecular representation learning method for molecular property prediction

Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

Understanding the Limitations of Deep Models for Molecular Property Prediction: Insights and Solutions.

Ensemble Model With Bert,Roberta and Xlnet For Molecular property prediction

A Novel Molecular Representation Learning for Molecular Property Prediction with a Multiple SMILES-Based Augmentation

A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence

MolRep: A Deep Representation Learning Library for Molecular Property Prediction

MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction

Noisemol: A Noise-Robusted Data Augmentation Via Perturbing Noise for Molecular Property Prediction