Abstract:The n-octanol/buffer solution distribution coefficient at pH = 7.4 (log D7.4) is an indicator of lipophilicity, and it influences a wide variety of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties and druggability of compounds. In log D7.4 prediction, graph neural networks (GNNs) can uncover subtle structure-property relationships (SPRs) by automatically extracting features from molecular graphs that facilitate the learning of SPRs, but their performances are often limited by the small size of available datasets. Herein, we present a transfer learning strategy called pretraining on computational data and then fine-tuning on experimental data (PCFE) to fully exploit the predictive potential of GNNs. PCFE works by pretraining a GNN model on 1.71 million computational log D data (low-fidelity data) and then fine-tuning it on 19,155 experimental log D7.4 data (high-fidelity data). The experiments for three GNN architectures (graph convolutional network (GCN), graph attention network (GAT), and Attentive FP) demonstrated the effectiveness of PCFE in improving GNNs for log D7.4 predictions. Moreover, the optimal PCFE-trained GNN model (cx-Attentive FP, Rtest2 = 0.909) outperformed four excellent descriptor-based models (random forest (RF), gradient boosting (GB), support vector machine (SVM), and extreme gradient boosting (XGBoost)). The robustness of the cx-Attentive FP model was also confirmed by evaluating the models with different training data sizes and dataset splitting strategies. Therefore, we developed a webserver and defined the applicability domain for this model. The webserver (http://tools.scbdd.com/chemlogd/) provides free log D7.4 prediction services. In addition, the important descriptors for log D7.4 were detected by the Shapley additive explanations (SHAP) method, and the most relevant substructures of log D7.4 were identified by the attention mechanism. Finally, the matched molecular pair analysis (MMPA) was performed to summarize the contributions of common chemical substituents to log D7.4, including a variety of hydrocarbon groups, halogen groups, heteroatoms, and polar groups. In conclusion, we believe that the cx-Attentive FP model can serve as a reliable tool to predict log D7.4 and hope that pretraining on low-fidelity data can help GNNs make accurate predictions of other endpoints in drug discovery.

Systematic Modeling of logD 7.4 Based on Ensemble Machine Learning, Group Contribution and Matched Molecular Pair Analysis.

Developing the QSPR Model for Predicting the Storage Lipid/water Distribution Coefficient of Organic Compounds

LogD7.4 prediction enhanced by transferring knowledge from chromatographic retention time, microscopic pKa and logP

A comparison of molecular representations for lipophilicity quantitative structure–property relationships with results from the SAMPL6 logP Prediction Challenge

QSPR Study of N‐octanol/water Partition Coefficient of Some Aromatic Compounds Using Support Vector Regression

On-line column preconcentration for the determination of cobalt in sea water by flow-injection chemiluminescence detection

Comparative Analysis of Chemical Descriptors by Machine Learning Reveals Atomistic Insights into Solute-Lipid Interactions

Improved GNNs for Log D7.4 Prediction by Transferring Knowledge from Low-Fidelity Data

Dimensionally reduced machine learning model for predicting single component octanol-water partition coefficients

Development of liposome/water partition coefficients predictive models for neutral and ionogenic organic chemicals

Modern Semiempirical Electronic Structure Methods and Machine Learning Potentials for Drug Discovery: Conformers, Tautomers, and Protonation States

Improved GNNs for Log D 7.4 Prediction by Transferring Knowledge from Low-Fidelity Data

Development of Reliable Aqueous Solubility Models and Their Application in Druglike Analysis.

ALipSol: An Attention-Driven Mixture-of-Experts Model for Lipophilicity and Solubility Prediction

Computation of octanol-water partition coefficients by guiding an additive model with knowledge.

Adme Evaluation in Drug Discovery. 3. Modeling Blood-Brain Barrier Partitioning Using Simple Molecular Descriptors

Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity

Development and Test of Highly Accurate Endpoint Free Energy Methods. 2: Prediction of Logarithm of N‐octanol–water Partition Coefficient (logp) for Druglike Molecules Using MM‐PBSA Method

Gas chromatographic separation of isomers of benoxaprofen using liquid crystals.

A Quantitative Structure-Property Relationship Analysis Of Logp For Disubstituted Benzenes

An Empirical Additive Model for Aqueous Solubility Computation: Success and Limitations