009 基于配体的药效团模型
🏃🏻 快速开始
您可以直接在 Bohrium Notebook 上执行此文档。首先,请点击位于界面顶部的 开始连接 按钮,然后选择 bohrium-notebook:05-31 镜像并选择合适的的机器配置,稍等片刻即可开始运行。
📖 来源
本 Notebook 来自 https://github.com/volkamerlab/teachopencadd,由杨合 📨 修改搬运至 Bohrium Notebook。
Talktorial T009: This talktorial is part of the TeachOpenCADD pipeline described in the first TeachOpenCADD paper, comprising of talktorials T001-T010.
Note: Please run this notebook cell by cell. Running all cells in one is possible also, however, part of the nglview 3D representations might be missing.
Aim of this talktorial
In this talktorial, we use known EGFR ligands, which were selected and aligned in the previous talktorial, to identify donor, acceptor, and hydrophobic pharmacophoric features for each ligand. Those features are then clustered to define an ensemble pharmacophore, which represents the properties of the set of known EGFR ligands and can be used to search for novel EGFR ligands via virtual screening.
Learning goals
Contents in Theory
- Pharmacophore modeling
- Structure- and ligand-based pharmacophore modeling
- Virtual screening with pharmacophores
- Clustering: k-means
Contents in Practical
- Get pre-aligned ligands from previous talktorial
- Show ligands with NGLView
- Extract pharmacophore features
- Show the pharmacophore features of all ligands
- Hydrogen bond donors
- Hydrogen bond acceptors
- Hydrophobic contacts
- Collect coordinates of features per feature type
- Generate ensemble pharmacophores
- Set static parameters for k-means clustering
- Set static parameters for cluster selection
- Define k-means clustering and cluster selection functions
- Cluster features
- Select relevant clusters
- Get selected cluster coordinates
- Show clusters
- Hydrogen bond donors
- Hydrogen bond acceptors
- Hydrophobic contacts
- Show ensemble pharmacophore
References
- IUPAC pharmacophore definition (Pure & Appl. Chem (1998), 70, 1129-43)
- 3D pharmacophores in LigandScout (J. Chem. Inf. Model. (2005), 45, 160-9)
- Book chapter: Pharmacophore Perception and Applications (Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), 1, 259-82)
- Book chapter: Structure-Based Virtual Screening (Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), 1, 313-31).
- Monty Kier and the origin of the pharmacophore concept (Internet Electron. J. Mol. Des. (2007), 6, 271-9)
- Nik Stiefl's demonstration of pharmacophore modeling with RDKit (RDKit UGM 2016 on GitHub)
Theory
Pharmacophores
In computer-aided drug design, the description of drug-target interactions with pharmacophores is a well-established method. The term pharmacophore was defined in 1998 by a IUPAC working party: "A pharmacophore is the ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target structure and to trigger (or to block) its biological response." (Pure & Appl. Chem. (1998), 70, 1129-43)
In other words, a pharmacophore consists of several pharmacophoric features, which describe important steric and physico-chemical properties of a ligand observed to bind a target under investigation. Such physico-chemical properties (also called feature types) can be hydrogen bond donors/acceptors, hydrophobic/aromatic interactions, or positively/negatively charged groups, and the steric properties are defined by the 3D arrangement of these features.
Structure- and ligand-based pharmacophore modeling
In pharmacophore modeling, two main approaches are used, depending on the biological question and available data sources, i.e. structure- and ligand-based pharmacophore modeling.
Structure-based pharmacophore models are derived from protein-ligand complexes. Features are defined by observed interactions between the protein and ligand, ensuring that only those ligand moieties are used for virtual screening that have already been shown to be involved in ligand binding. However, structures of protein-ligand complexes are not available for all targets. In this case, either complex structures can be generated by modeling the ligand into the target binding site, e.g. via molecular docking, or pharmacophore modeling methods can be invoked that only use the target binding site to detect potential protein-ligand interaction sites.
Ligand-based pharmacophore models are based on a set of ligands known to bind the target under investigation. The common chemical features of these ligands build the pharmacophore model. This method is used for targets with multiple known ligands and in case of missing protein-ligand complex structures. In this talktorial, we will use ligand-based pharmacophore modeling using a set of known EGFR ligands.
For more information on pharmacophore modeling, we recommend Pharmacophore Perception and Applications: Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), 1, 259-82 and J. Chem. Inf. Model. (2005), 45, 160-9.
Figure 1: Structure-based pharmacophore representing protein-ligand interactions. Figure created by Dominique Sydow.
Virtual screening with pharmacophores
As described earlier in Talktorial 4, virtual screening (VS) describes the screening of a query, e.g. a pharmacophore model (T009) or a query compound (T004), against a large library of compounds, in order to identify the small molecules in the library that are most likely to bind a target under investigation (represented by the query). In pharmacophore-based virtual screening, the compound library is matched compound-by-compound into a pharmacophore model and ranked by the best matching results (Structure-Based Virtual Screening: Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), 1, 313-31).
Clustering: k-means
In this talktorial we will generate an ensemble pharmacophore by clustering the feature points of several ligand-based pharmacophores. The clustering algorithm uses the k-means clustering, which aims to cluster a data set into k clusters:
- k different centroids are selected and each point of the data set is assigned to its closest centroids.
- New centroids are calculated based on the current clusters and each point of the data set is newly assigned to its closest centroids.
- This procedure is repeated until the centroids are stable.
More info about k-means in Wikipedia.
Practical
Set the Path to the current notebook.
Get pre-aligned ligands from previous talktorial
We retrieve all ligands that were aligned in the previous talktorial.
First, we get the file paths to all ligand PDB files.
['5HG8_lig.pdb', '5UG8_lig.pdb', '5UG9_lig.pdb', '5UGC_lig.pdb']
['5HG8', '5UG8', '5UG9', '5UGC']
Second, we read all ligands from these PDB files using RDKit.
CCC(O)NC1CCCC(OC2NC(NC3CNN(C)C3)NC3NCCC32)C1 CCC(O)NC1CN(C2NC(NC3CNN(C)C3)C3NCN(C(C)C)C3N2)CC1F CCC(O)NC1CN(C2NC(NC3CN(C)NC3OC)C3NCN(C(C)C)C3N2)CC1F CCC(O)NC1CN(C2NC(NC3CN(C)NC3OC)C3NCN(C)C3N2)CC1F Number of molecules: 4
Upon inspection of the structures we notice, that we encountered a problem here: When loading ligands from a PDB file, RDKit does not assign bonds in structures like aromatic rings correctly. We use the RDKit function AssignBondOrdersFromTemplate
, which assigns bonds to a molecule based on a reference molecule, e.g. in our case based on the SMILES pattern of the molecule.
Check for further information: RDKit discussion on Aromaticity of non-protein molecules in PDB not detected and RDKit documentation on AssignBondOrdersFromTemplate
.
[00:05:46] WARNING: More than one matching pattern found - picking one [00:05:46] WARNING: More than one matching pattern found - picking one
Now the correct bonds are included. We can also have a look at the molecules in 2D. In order to keep the original coordinates, we make a copy for the 2D representation.
Visualize with nglview
In the next step we visualize the ligand molecules with nglview using the following function. The molecules will be drawn on top of each other in a single view. This way we are able to visually compare strucural differences between the ligands.
Extract pharmacophore features
As described above, the aim of this talktorial is to generate a ligand-based ensemble pharmacophore from a set of ligands. First, we need to extract pharmacophore features per ligand. Therefore, we load a feature factory. We use the default feature definitions.
See also the RDKit documentation on chemical features and pharmacophores.
We take a look at the pharmacophore features that are implemented in RDKit:
['Donor.SingleAtomDonor', 'Acceptor.SingleAtomAcceptor', 'NegIonizable.AcidicGroup', 'PosIonizable.BasicGroup', 'PosIonizable.PosN', 'PosIonizable.Imidazole', 'PosIonizable.Guanidine', 'ZnBinder.ZnBinder1', 'ZnBinder.ZnBinder2', 'ZnBinder.ZnBinder3', 'ZnBinder.ZnBinder4', 'ZnBinder.ZnBinder5', 'ZnBinder.ZnBinder6', 'Aromatic.Arom4', 'Aromatic.Arom5', 'Aromatic.Arom6', 'Aromatic.Arom7', 'Aromatic.Arom8', 'Hydrophobe.ThreeWayAttach', 'Hydrophobe.ChainTwoWayAttach', 'LumpedHydrophobe.Nitro2', 'LumpedHydrophobe.RH6_6', 'LumpedHydrophobe.RH5_5', 'LumpedHydrophobe.RH4_4', 'LumpedHydrophobe.RH3_3', 'LumpedHydrophobe.tButyl', 'LumpedHydrophobe.iPropyl']
As an example, we extract all features from the first molecule.
Number of features found: 16
The type of a feature is called family in RDKit and can be retrieved with GetFamily()
.
'Donor'
First we get the frequency of feature types for our example molecule.
Counter({'Acceptor': 5, 'Aromatic': 4, 'Donor': 4, 'Hydrophobe': 2, 'LumpedHydrophobe': 1})
Now apply the functions shown above to all molecules in our ligand set. We calculate the feature type frequency per molecule, then we display it in a DataFrame.
Mol1 | Mol2 | Mol3 | Mol4 | |
---|---|---|---|---|
Donor | 4 | 2 | 2 | 2 |
Acceptor | 5 | 6 | 7 | 7 |
Aromatic | 4 | 3 | 3 | 3 |
Hydrophobe | 2 | 1 | 1 | 1 |
LumpedHydrophobe | 1 | 1 | 1 | 0 |
PosIonizable | 0 | 1 | 1 | 1 |
Furtheron, we concentrate in this talktorial only on the following feature types: hydrogen bond acceptors (acceptors), hydrogen bond donors (donors), and hydrophobic contacts (hydrophobics).
We retrieve the feature RDKit objects per feature type and per molecule.
Show the pharmacophore features of all ligands
Pharmacophore feature types usually are displayed in defined colors, e.g. usually hydrogen bond donors, hydrogen bond acceptors, and hydrophobic contacts are colored green, red, and yellow, respectively.
We use this function to visualize the features for the feature types under consideration.
Hydrogen bond donors
Number of donors in all ligands: 10
Hydrogen bond acceptors
Number of acceptors in all ligands: 25
Hydrophobic contacts
Number of hydrophobics in all ligands: 5
Collect coordinates of features per feature type
We want to cluster features per feature type. To do this we collect all coordinates of features and store them per feature type.
Now, we have the positions of e.g. all acceptor features:
[[-13.49, 14.732, -27.925], [-11.99, 13.164, -28.961], [-13.173, 17.112, -28.025], [-16.131, 17.359, -22.003], [-16.873, 11.631, -25.856], [-13.808, 14.802, -27.137], [-12.146, 16.255, -28.067], [-11.385, 12.957, -29.278], [-16.736, 10.854, -25.879], [-15.435, 19.394, -27.21], [-16.039, 17.251, -22.54], [-13.863, 14.66, -27.134], [-12.175, 16.118, -27.994], [-11.411, 12.856, -29.266], [-17.026, 10.809, -26.396], [-15.601, 9.901, -28.086], [-15.445, 19.229, -27.215], [-16.044, 17.165, -22.506], [-13.831, 14.581, -27.162], [-12.161, 16.037, -28.079], [-11.381, 12.733, -29.268], [-17.02, 10.739, -26.48], [-15.542, 9.814, -28.111], [-15.515, 19.1, -27.334], [-16.09, 17.203, -22.553]]
Generate ensemble pharmacophores
In order to generate ensemble pharmacophores, we use k-means clustering to cluster features per feature type.
Set static parameters for k-means clustering
kq
: With this paramter, we determine the number of clustersk
per feature type depending on the number of feature points, i.e. per feature typek
= number_of_features /kq
kq
should be selected so thatk
(feature clusters) is for all clusters at least 1 and not larger than 4-5 clusters
We set our kq
used to determine k
in k-means as follows:
Set static parameters for cluster selection
min_cluster_size
: We only want to retain clusters that potentially contain features from most molecules in our ligand ensemble. Therefore, we set this variable to 75% of the number of molecules in our ligand ensemble.top_cluster_number
: With this parameter, we select only the largest clusters.
Define k-means clustering and cluster selection functions
We define a function that calculates the centers of clusters, which are derived from k-means clustering.
We define a function that sorts the clusters by size and outputs a list of indices of the largest clusters.
Cluster features
For each feature type, we perform the k-means clustering with our defined clustering
function.
Clustering: Variable k in k-means: 2 of 10 points Clustering: Variable k in k-means: 4 of 25 points Clustering: Variable k in k-means: 2 of 5 points
Select relevant clusters
For each feature type, we select relevant clusters with our defined get_clusters
function.
Hydrogen bond donors Clusters sorted by size: [(1, 6), (0, 4)] Cluster indices of selected clusters: [1, 0]
Hydrogen bond acceptors Clusters sorted by size: [(0, 12), (2, 6), (1, 4), (3, 3)] Cluster indices of selected clusters: [0, 2, 1, 3]
Hydrophobic contacts Clusters sorted by size: [(0, 4), (1, 1)] Cluster indices of selected clusters: [0]
Get selected cluster coordinates
[[-12.567833333333333, 14.66725, -28.191333333333333], [-16.466333333333335, 10.624666666666666, -26.801333333333332], [-16.076, 17.244500000000002, -22.4005], [-15.465, 19.241, -27.253]]
Show clusters
Per feature type, we visualize cluster centers alongside with all molecules and all feature points.
Hydrogen bond donors
Number of donors in all ligands: 10
Hydrogen bond acceptor
Number of acceptors in all ligands: 25
Hydrophobic contacts
Number of hydrophobics in all ligands: 5
Show ensemble pharmacophore
In this last step, we combine the clustered pharmacophoric features (i.e. hydrogen bond donors and acceptors as well as hydrophobic contacts), to one ensemble pharmacophore, representing the pharmacophoric properties of the four selected ligands.
Discussion
In this talktorial, we used a set of pre-aligned ligands, which are known to bind EGFR, to generate an ensemble pharmacophore model. This model could now be used for virtual screening against a large library of small molecules, in order to find novel small molecules that show the observed steric and physico-chemical properties and might therefore also bind to the EGFR binding site.
Before screening, the pharmacophore models are usually further optimized, e.g. features might be omitted in order to reduce the number of features for screening based on biological knowledge (some interaction might be reported as important whereas others are not) or based on chemical expertise.
We do not cover the virtual screening in this talktorial, however refer to an excellent tutorial by Nik Stiefl, demonstrating pharmacophore modeling and virtual screening with RDKit (RDKit UGM 2016 on GitHub).
We used k-means clustering to cluster pharmacophore feature. This clustering approach has the disadvantage that the user needs to define the number of clusters beforehand, which is usually based on a visual inspection of the point distribution before clustering (or during cluster refinement) and is therefore hindering for an automated pharmacophore generation. Density-based clustering methods, also in combination with k-means clustering, can be a solution for this. For further reading we recommend the introduction to Density-Based Clustering at Domino Data Lab.
Reference
https://github.com/volkamerlab/teachopencadd
Reprint statement
Original title: Ligand-based pharmacophores
Authors:
- Pratik Dhakal, CADD seminar, 2017, Charité/FU Berlin
- Florian Gusewski, CADD seminar, 2018, Charité/FU Berlin
- Jaime Rodríguez-Guerra, Volkamer lab, Charité
- Dominique Sydow, Volkamer lab, Charité