AlphaFold meets de novo drug design: leveraging structural protein information in multi-target molecular generative models

Andrius Bernatavicius,Martin Šícho,Antonius Janssen,Alan Kai Hassen,Mike Preuss,Gerard van Westen
DOI: https://doi.org/10.26434/chemrxiv-2024-60tc7-v2
2024-04-18
Abstract:Advances in deep learning have expanded the applications of virtual screening for drug-like compounds. More recently generative models have emerged as sources of inspiration for chemists. We introduce a multi-target model, PCMol, that leverages the latent embeddings derived from AlphaFold as a means of conditioning the de novo generative model on target proteins. It is known that the addition of protein descriptors is an effective strategy to extend the applicability domain and prediction capability of quantitative structure-activity relation (QSAR) models, a strategy we refer to as proteochemometrics (PCM). Similarly, the use of AlphaFold latent embeddings within a generative model for small molecules allows it to leverage structural relationships between proteins. This opens up new possibilities such as interpolation within the chemical space of known highly active compounds and extrapolation on the target side based on their similarities to other proteins, which is especially relevant for understudied or novel targets. Our results indicate that PCMol can generate diverse, potentially active molecules for a wide array of proteins, including those with sparse ligand bioactivity data. We also benchmark against existing target-conditioned trans-former models to illustrate the validity of using AlphaFold protein representations to steer the molecular generation process and increase the generalization capabilities to unseen targets. Additionally, we demonstrate the important role of data augmentation in bolstering the performance of generative models in low-data regimes. The open-source package along with a dataset of AlphaFold protein embeddings is available at https://github.com/CDDLeiden/PCMol.
Chemistry
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the multi - target molecule generation model by using the protein structure information generated by AlphaFold, especially in the application of new drug design. Specifically, the paper introduces a new model named PCMol. This model can use the latent embeddings of proteins extracted from AlphaFold as conditional inputs to guide the generation process of new molecules, especially for the generation of target proteins. This method aims to expand the application scope and prediction ability of the quantitative structure - activity relationship (QSAR) model by introducing protein descriptors or embedding representations, the so - called protein chemometrics (PCM) strategy. The key points of the paper include: 1. **Improving the generalization ability of the multi - target molecule generation model**: By using the protein representations of AlphaFold, the model can generate diverse and potentially active molecules on unseen targets, which is especially important when dealing with sparsely - data targets. 2. **Enhancing the performance in the case of low data volume**: Through data augmentation techniques, such as SMILES enumeration, the diversity of the training data set is increased, thereby improving the performance of the model in the case of low data volume. 3. **Verifying the effectiveness of the model**: By benchmarking with existing target - conditional transformer models, the effectiveness of using AlphaFold protein representations to guide the molecule generation process is proved, and the generation ability of the model on multiple proteins is demonstrated. In conclusion, this research aims to develop a more powerful and more general - purpose multi - target molecule generation model that can be effectively applied to new drug design, especially when dealing with new or under - studied targets.