Abstract:Binding affinity and molecular property prediction are crucial for drug discovery. Over the years, deep learning models have been widely used for these tasks; however, large datasets are often needed to achieve strong performances. Pre-training models on vast unlabelled data has emerged as a method to extract contextualised embeddings that boost performance on smaller datasets. SMILES (Simplified Molecular Input Line Entry System) encode molecular structures as strings, making them suitable for natural language processing (NLP). Transformers, known for capturing long-range dependencies, are well-suited for processing SMILES. One such transformer-based architecture is BERT (Bidirectional Encoder Representations from Transformers), which only uses the encoder part of the Transformer and performs classification and regression tasks. Pre-trained transformer-based architectures using SMILES have significantly improved predictions on smaller datasets. Public data repositories such as PubChem, which provide SMILES and physicochemical properties among other data, are essential for pre-training these models. SMILES embeddings that combine chemical structure and physicochemical properties information could further enhance performance on tasks such as binding affinity prediction. For this reason, we introduce Smile-to-Bert, a pre-trained BERT architecture that predicts seven physicochemical properties from SMILES using PubChem data and two different SMILES tokenizers. Moreover, this model generates embeddings that integrate information about molecular structure and physicochemical properties. Regarding the prediction of physicochemical properties, the mean absolute errors obtained are: H-bond acceptors (0.0502), H-bond donors (0.0048), rotatable bonds (0.0949), exact mass (0.5678), TPSA (0.4961), heavy atom count (0.0345), and log-P (0.2219). Additionally, the usefulness of the generated embeddings is evaluated using two binding affinity datasets and their performance is compared to embeddings created by a state-of-the-art Transformer. We show that the SmilesPE tokenizer works better than the atom-level one and that integrating the embeddings generated by Smile-to-Bert to the state-of-the-art Transformer embeddings improves the prediction of binding affinity in one of the datasets. A dashboard for the prediction of physicochemical properties is available at http://147.83.252.32:8050/, and the code is accessible at https://github.com/m-baralt/smile-to-bert.

Smile-to-Bert: A BERT architecture trained for physicochemical properties prediction and SMILES embeddings generation

Knowledge-based BERT: a method to extract molecular features such as computational chemists

Knowledge-based BERT: a Method to Extract Molecular Features Like Computational Chemists

Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

MolRoPE-BERT: an Enhanced Molecular Representation with Rotary Position Embedding for Molecular Property Prediction

A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence

SMG-BERT: integrating stereoscopic information and chemical representation for molecular property prediction

ChemBERTa-2: Towards Chemical Foundation Models

Transformer-CNN: Fast and Reliable tool for QSAR

Infusing Linguistic Knowledge of SMILES into Chemical Language Models

Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction

SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction

Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules

MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction

SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery

SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties

PeptideBERT: A Language Model based on Transformers for Peptide Property Prediction

Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery

GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction

Predicting Chemical Properties using Self-Attention Multi-task Learning based on SMILES Representation