Interpretable deep-learning pKa prediction for small molecule drugs via atomic sensitivity analysis

Joseph DeCorte,Benjamin Brown,Jens Meiler
DOI: https://doi.org/10.26434/chemrxiv-2024-hr692
2024-06-12
Abstract:Machine learning (ML) models play a crucial role in predicting properties essential to drug development, such as a drug’s logscale acid-dissociation constant (pKa). Despite recent architectural advances, these models often generalize poorly to novel compounds due to a scarcity of ground-truth data. Further, these models lack interpretability, in part due to a dependence on explicit encodings of input molecules’ molecular substructures. To this end, atomic-resolution information is accessible in chemical structures by observing model response to atomic perturbations of an input molecule; however, no methods exist that systematically utilize this information for model and molecular analysis. Here, we present BCL-XpKa, a substructure-independent, deep neural network (DNN)-based pKa predictor that generalizes well to novel small molecules. BCL-XpKa discretizes pKa prediction from a regression problem into a multitask-classification problem, which accumulates data for prediction at biologically relevant pH values and records the model’s uncertainty in its prediction as a discrete distribution for each pKa prediction. BCL-XpKa outperforms modern ML pKa predictors and accurately models the effects of common molecular modifications on a molecule’s ionizability. We then leverage BCL-XpKa’s substructure independence to introduce atomic sensitivity analysis (ASA), which quickly decomposes a molecule’s predicted pKa value into its respective atomic contributions without model retraining. When paired with BCL-XpKa, ASA informs that BCL-XpKa has implicitly learned high-resolution information about molecular substructures. We further demonstrate ASA’s utility in structure preparation for protein-ligand docking by identifying ionization sites in 97.8% and 83.4% of complex small molecule acids and bases. We then apply ASA with BCL-XpKa to understand the physicochemical liabilities and guide optimization of a recently published KRAS-degrading PROTAC.
Chemistry
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the accuracy and generalization ability of predicting the acid dissociation constant (pKₐ) of small - molecule drugs, while enhancing the interpretability of the model. Specifically: 1. **Improve prediction accuracy**: By developing a new multi - task classifier (BCL - XpKa), the continuous pKₐ prediction problem is transformed into a multi - task classification problem, thereby improving the prediction accuracy while reducing information loss. 2. **Enhance model generalization ability**: Traditional machine - learning methods often perform poorly when dealing with new compounds because these models rely on explicitly encoded molecular sub - structure features, which limit their adaptability to new compounds. BCL - XpKa improves the model's generalization ability for new compounds by using local atomic - environment embedding instead of relying on specific molecular sub - structures. 3. **Improve model interpretability**: In order to better understand the prediction results of the model, the paper introduces a new atomic sensitivity analysis (ASA) method. This method can quickly decompose the predicted pKₐ value of a molecule by performing atomic - level perturbations on the input molecule, thereby providing an atomic - level contribution analysis without retraining the model. 4. **Application examples**: The paper shows the applications of BCL - XpKa and ASA in actual drug design, especially when optimizing KRAS - degrading PROTAC (a small - molecule complex for targeted protein degradation). By identifying and modifying the key atoms that affect molecular ionization, the bioavailability and cell permeability of PROTAC are improved. In summary, this paper aims to solve the limitations of existing pKₐ prediction models in terms of accuracy and generalization ability by improving the architecture of the prediction model and introducing new interpretation methods, and to provide more powerful tools for drug design.