014 结合位点的检测
🏃🏻 快速开始
您可以直接在 Bohrium Notebook 上执行此文档。首先,请点击位于界面顶部的 开始连接 按钮,然后选择 bohrium-notebook:05-31 镜像并选择合适的的机器配置,稍等片刻即可开始运行。
📖 来源
本 Notebook 来自 https://github.com/volkamerlab/teachopencadd,由杨合 📨 修改搬运至 Bohrium Notebook。
Aim of this talktorial
The binding site of a protein is the key to its function. In this talktorial, we introduce the concepts of computational binding site detection tools using DoGSiteScorer from the protein.plus web server, exemplified on an EGFR structure. Additionally, we compare the results to the pre-defined KLIFS binding site by calculating the percentage of residues in accordance between the two sets.
Contents in Theory
- Protein binding sites
- Binding site detection
- Methods overview
- DoGSiteScorer
- Comparison to KLIFS pocket
Contents in Practical
- Binding site detection using DoGSiteScorer
- Job submission of structure of interest
- Get DoGSiteScorer pocket metadata
- Pick the most suitable pocket
- Get best binding site file content
- Investigate detected pocket
- Comparison between DoGSiteScorer and KLIFS pocket
- Get DoGSiteScorer pocket residues
- Get KLIFS pocket residues
- Overlap of pocket residues
References
- Prediction, Analysis, and Comparison of Active Sites Volkamer et al., (2018), book chapter in Applied Chemoinformatics: Achievements and Future Opportunities, Wiley
- DoGSiteScorer, Volkamer et al., J.Chem.Inf.Model, (2012), 52(2):360-372
- ProteinsPlus: a web portal for structure analysis of macromolecules. Fährrolfes et al., NAR, (2017), 3;45(W1)
- KLIFS: a structural kinase-ligand interaction database, Kanev et al., NAR, (2021), 49(D1):D562-D569
Theory
Protein binding sites
Most biological processes are guided through (non-)reversible binding of molecules. Given a therapeutic target associated to a specific disease, knowing its binding site(s), i.e. the key to the proteins function, is of utmost importance for designing new drugs.
Depending on the given data, e.g., no protein-ligand complex structure (x-ray) is available or one is interested in allosteric sites, binding site detection algorithms come into play. Binding sites, or in the case of enzymes rather called active sites, are cavities in 3-dimensional space, mostly on the surface of a protein structure, that serve as binding (docking) regions for ligands, peptides, or proteins. To interact with each other, the two binding partners need to be complementary concerning shape and physico-chemical properties (key and look principle).
Figure 1: Example of a detected binding site for EGFR kinase (PDB: 3w32) using DoGSiteScorer from proteins.plus. Protein shown as blue cartoon, ligand as sticks (carbons in gray) and binding site as violet cloud (largest subpocket SP_0_0 shown).
Binding site detection
Methods overview
If ligand information is available, i.e., a protein-ligand complex, the protein residues surrounding the ligand can simply be defined as pocket (e.g. using all protein residues within a predefined radius of the ligand atoms such as 6 Å). If the ligand is absent, prediction tools can be used for in silico pocket detection. These methods can be grouped on the one hand in geometry- and energy-based methods as well as on the other hand in grid-based and grid-free approaches, as outlined in Figure 2. Note that in recent years, more and more machine or deep learning based methods have been developed (see e.g. DeepSite by Jiménez et al., Bioinformatics, 2017, 33(19), 3036–3042).
Figure 2: Binding site detection methods can be grouped into geometry-based and energy-based approaches as well as grid-based and grid-free approaches. Figure taken from Prediction, Analysis, and Comparison of Active Sites, Volkamer et al., (2018), Applied Chemoinformatics: Achievements and Future Opportunities, Wiley.
Geometry-based approaches analyze the shape of a molecular surface to locate cavities and incorporate the 3D spatial arrangement of the atoms on the protein surface. Energy-based approaches record interactions of probes or molecular fragments with the protein, thus, favorable energetic responses are assigned to pockets. Both strategies can be performed on a Cartesian grid-based representation of the protein (i.e. checking the environment per grid point) or without (i.e. grid-free).
In the following, an example for each of the four categories will be shortly introduced:
- Geometric, grid-based approach: In LIGSITE (Hendlich, et al., J Mol Graph Model., 1997, 15(6):359-63, 389), a Cartesian grid (e.g. 1Å grid spacing) is spanned over the protein of interest. Each grid point is then scanned in seven direction (along the x, y and z axes as well as the four cubic diagonals) and the number of Protein-Solvent-Protein (PSP) events per point is stored (# rays restricted on both side by the protein). Finally, grid points that are buried (= have a high PSP value) are clustered to pockets.
- Geometric, grid-free approach: In SURFNET (Laskowski, J Mol Graph., 1995, 13(5):323-30, 307-8), spheres are placed midway between all pairs of atoms on the protein surface directly. In case a probe clashes with any nearby atom, its radius is reduced until no overlap occurs. The resulting probes define the cavities.
- Energy, grid-based approach: In DrugSite (An, et al., Genome Informatics, 2004, 15(2): 31–41), the protein is embedded in a Cartesian grid and carbon probes are placed on each grid point. Then, van der Waals energies between the probe and the protein environment within 8 Å distance are calculated. Grid points with unfavorable energies, i.e., above an energy cut-off based on the mean energy and standard deviation over the whole grid, are discarded. Finally, grid points fulfilling this cut-off are merged to pockets.
- Energy, grid-free approach: In docking-based methods, fragments (or small molecules) are docked against the protein of interest (placed and scored, for more info on docking see talktorial T015). Pockets are then assigned based on the quantity of fragments that bind to a specific area.
DoGSiteScorer
In this talktorial, we will use the DoGSiteScorer functionality, available within protein.plus, to detect and score the pockets of a protein of interest. Thus, the algorithm will be explained in a bit more detail (see Figure 3 for a visual explanation).
Pocket detection: DoGSiteScorer incorporates a geometric and grid-based algorithm to detect pockets. The protein is embedded in a Cartesian grid, and each grid point is labeled as either 0 (free) or 1 (occupied), depending on if it lies within any protein atom's van der Waals radius. Then, an edge-detection algorithm from image processing, a Difference of Gaussian filter (thus the name DoGSite) is invoked to identify protrusion on the protein surface (i.e. the positions on a protein surface where the location of a sphere-like object is favorable). Based on specific cut-off criteria, grid points with the highest intensity are selected and first clustered to subpockets, then merged to pockets.
Descriptor calculation: Based on the grid representation of the respective pocket as well as the surrounding protein atoms, properties describing the pocket are derived. These include properties such as volume, surface, or depth of the pocket (calculated directly from the properties of the individual grid points) as well as hyprophobicity, number of available hydrogen bond donors/acceptors or amino acid count (derived from the neighboring protein residues).
Druggability estimates: Additionally, the tool has an in-built druggability predictor. Druggability can be defined as the ability of a (disease-associated) target to bind - and potentially be modulated by - low molecular weight compounds (sometimes also referred to as ligandability). In DoGSiteScorer, druggablity is predicted using a support vector machine (SVM) model, trained and tested on the freely available (non-redundant) druggability data set (NR) DD. The DD consists of 1069 targets and each target was assigned to one of the three classes: druggable, difficult, or undruggable.
For more detail on DoGSiteScorer's druggability model, see Volkamer et al., J. Chem. Inf. Model., 2012, 52, 2, 360–372. For more info on the druggability concept itself refer to e.g. Hopkins and Groom, The druggable genome. Nat Rev Drug Discov 1, 727–730 (2002).
Figure 3: Schematic representation of the individual steps within DoGSiteScorer: A. Pocket detection, B. descriptor calculation and C. druggability estimation. Figure newly composed based on Volkamer et al., J. Chem. Inf. Model., 2012, 52(2), 360–372 and Volkamer et al., J. Chem. Inf. Model., 2010, 50(11), 2041–2052.
Comparison to KLIFS pocket
Once we obtain the binding site of interest from DoGSiteScorer, we can compare the results with any other method in order to validate it. Here, we compare it with the KLIFS binding pocket definition for our target kinase structure using the KLIFS API (see Talktorial T012 for more detail).
KLIFS pocket definition in a nutshell: The KLIFS (Kinase-Ligand Interaction Fingerprints and Structures) database is a structural repository of information on over 3600 human and mouse kinase structures. The curated KLIFS data allows systematic analyses of all kinase structures and binding sites, bound ligands and protein-ligand interactions. KLIFS comes with a nomenclature of typical structural motifs within kinases (such as DFG-in/out, hinge region, ...) and maps the binding site of all known kinases to 85 residues, defined via an elaborated multiple sequence alignment. It is possible to compare the interaction patterns of kinase-inhibitors to each other to, for example, identify crucial interactions determining kinase-inhibitor selectivity.
In this talktorial, we will query the KLIFS API to return the binding pocket of a specific protein kinase structure of interest for further analysis.
Practical
In this practical part, we will introduce how to query the proteins.plus server for binding site detection using DoGSiteScorer for our protein of interest.
Import all necessary libraries.
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting https://github.com/volkamerlab/opencadd/archive/master.tar.gz Using cached https://github.com/volkamerlab/opencadd/archive/master.tar.gz Preparing metadata (setup.py) ... done WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Add globals to this talktorial's path (HERE
) and its data folder (DATA
).
Binding site detection using DoGSiteScorer
We first define a function to query the server for a protein of interest. Infos on the REST API can be found here.
Job submission for structure of interest
Define the protein of interest. As an example, we chose an EGFR kinase structure, more details can be found in the respective PDB entry: 3w32.
Submit the query and identify the job location (URL) on the web server for further investigation.
Get DoGSiteScorer pocket metadata
We now define a function that collects all data returned by the server and stores them in a pandas DataFrame
.
All detected pockets from DoGSiteScorer web server are retrieved and a dataframe is returned containing all pocket descriptor information.
Querying for job at URL https://proteins.plus/api/dogsite_rest/twNQ4AjQiuzoUUMsSdA3NxH4...
lig_cov | poc_cov | lig_name | volume | enclosure | surface | depth | surf/vol | lid/hull | ellVol | ell c/a | ell b/a | siteAtms | accept | donor | hydrophobic_interactions | hydrophobicity | metal | Cs | Ns | Os | Ss | Xs | negAA | posAA | polarAA | apolarAA | ALA | ARG | ASN | ASP | CYS | GLN | GLU | GLY | HIS | ILE | LEU | LYS | MET | PHE | PRO | SER | THR | TRP | TYR | VAL | simpleScore | drugScore | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||||||||||||||||||||||||||||||||||||
P_0 | 85.48 | 31.22 | W32_A_1101 | 1422.66 | 0.10 | 1673.75 | 19.26 | 1.176493 | - | - | 0.13 | 0.67 | 288 | 86 | 40 | 71 | 0.36 | 0 | 198 | 45 | 41 | 4 | 0 | 0.10 | 0.13 | 0.24 | 0.53 | 4 | 5 | 2 | 5 | 2 | 2 | 1 | 5 | 0 | 3 | 12 | 3 | 2 | 3 | 3 | 1 | 2 | 1 | 1 | 5 | 0.63 | 0.810023 |
P_0_0 | 85.48 | 73.90 | W32_A_1101 | 599.23 | 0.06 | 540.06 | 17.51 | 0.901257 | - | - | 0.14 | 0.22 | 131 | 35 | 13 | 25 | 0.34 | 0 | 95 | 16 | 17 | 3 | 0 | 0.03 | 0.10 | 0.28 | 0.59 | 1 | 2 | 1 | 1 | 2 | 1 | 0 | 2 | 0 | 2 | 7 | 1 | 2 | 2 | 1 | 0 | 2 | 0 | 0 | 2 | 0.59 | 0.620201 |
P_0_1 | 3.23 | 0.44 | W32_A_1101 | 201.73 | 0.08 | 381.07 | 11.36 | 1.889010 | - | - | 0.17 | 0.25 | 51 | 17 | 9 | 10 | 0.28 | 0 | 36 | 6 | 7 | 2 | 0 | 0.08 | 0.17 | 0.25 | 0.50 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0.17 | 0.174816 |
P_0_2 | 0.00 | 0.00 | W32_A_1101 | 185.60 | 0.17 | 282.00 | 9.35 | 1.519397 | - | - | 0.45 | 0.55 | 48 | 17 | 8 | 12 | 0.32 | 0 | 31 | 8 | 8 | 1 | 0 | 0.17 | 0.25 | 0.08 | 0.50 | 0 | 2 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 0.13 | 0.195695 |
P_0_3 | 6.45 | 0.29 | W32_A_1101 | 175.30 | 0.15 | 297.42 | 9.29 | 1.696634 | - | - | 0.23 | 0.37 | 48 | 16 | 8 | 14 | 0.37 | 0 | 32 | 8 | 8 | 0 | 0 | 0.14 | 0.14 | 0.36 | 0.36 | 1 | 1 | 1 | 2 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0.13 | 0.168845 |
Note: We selected to have pockets and subpockets returned from the algorithm, e.g. pocket P_0
contains subpockets P_0_0
to P_0_5
. Depending on your research question you might prefer to work with the pockets - mostly if you are more interested in the general location (e.g. for docking) - or the subpockets, which give a more granular description.
Pick the most suitable pocket
The pockets are initially sorted by the volume of the respective pocket.
We want to have a deeper look at the pocket that contains the ligand (see ligand coverage column lig_cov
) and which is covered by the ligand the most (pocket coverage poc_cov
).
For clarity, we drop a few columns first from the dataframe.
Sort the obtained binding site by your column of interest.
lig_cov | poc_cov | lig_name | volume | enclosure | surface | depth | surf/vol | accept | donor | hydrophobic_interactions | hydrophobicity | metal | simpleScore | drugScore | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||
P_0_0 | 85.48 | 73.90 | W32_A_1101 | 599.23 | 0.06 | 540.06 | 17.51 | 0.901257 | 35 | 13 | 25 | 0.34 | 0 | 0.59 | 0.620201 |
P_0 | 85.48 | 31.22 | W32_A_1101 | 1422.66 | 0.10 | 1673.75 | 19.26 | 1.176493 | 86 | 40 | 71 | 0.36 | 0 | 0.63 | 0.810023 |
P_0_3 | 6.45 | 0.29 | W32_A_1101 | 175.30 | 0.15 | 297.42 | 9.29 | 1.696634 | 16 | 8 | 14 | 0.37 | 0 | 0.13 | 0.168845 |
P_0_1 | 3.23 | 0.44 | W32_A_1101 | 201.73 | 0.08 | 381.07 | 11.36 | 1.889010 | 17 | 9 | 10 | 0.28 | 0 | 0.17 | 0.174816 |
P_0_2 | 0.00 | 0.00 | W32_A_1101 | 185.60 | 0.17 | 282.00 | 9.35 | 1.519397 | 17 | 8 | 12 | 0.32 | 0 | 0.13 | 0.195695 |
Note: We decided here to select the pocket with the best ligand and pocket coverage to have a precise description of the pocket for our comparison later. In other drug design scenarios, you might want to simply sort by drugScore
(metadata.sort_values(by='drugScore', ascending=False)
) or other values, such as volume
or simpleScore
.
We define a function that returns the ID (variable is called best_pocket_id
) of the pocket that fullfills our sorting criterion.
We now programatically access the name of the best pocket based on the selected sorting.
'P_0_0'
Get binding site file content
We define two helper functions to first get the URL for all pocket pdb or ccp4 files, respectively. Then, we only process the files further for our pocket of interest.
Note that DoGSiteScorer returns two structural (3D) file formats per pocket:
- pdb: These files contain all atoms/residues surrounding the pocket.
- Accessed via
residues
- Naming scheme, e.g. 3w32_P_0_0
_res.pdb
- Accessed via
- ccp4: These files contain the grid points composing the pocket, also considered the volume of the pocket.
- Accessed via
pockets
- Naming scheme, e.g. 3w32_P_0_0
_gpsAll.ccp4.gz
- Accessed via
Get URL for PDB file containing the selected pocket's residues.
Investigate detected pocket
Pocket descriptors
The algorithm returns information on all calculated descriptors for the respective pocket.
lig_cov 85.48 poc_cov 73.9 lig_name W32_A_1101 volume 599.23 enclosure 0.06 surface 540.06 depth 17.51 surf/vol 0.901257 accept 35 donor 13 hydrophobic_interactions 25 hydrophobicity 0.34 metal 0 simpleScore 0.59 drugScore 0.620201 Name: P_0_0, dtype: object
Visualize the pocket
We need to get the respective ccp4
file that contains the pocket information. First, we get the URL for theccp4
file containing the selected pocket's volume.
We get the grid point coordinates from that proteins.plus URL. We do not save this to disk, but to a memory file (io.BytesIO
). Thanks to the extension in the URL, we know this is a gzipped (.gz
extension) CCP4 map (.ccp4
). What we do:
- Download the file from the server into memory. At this point we have gzipped bytes.
- We decompress the response in memory using
gzip.decompress
. - We wrap it in a
io.BytesIO
object that will behave as an opened file.
Note: If the pocket volume does not show in the NGLviewer, please try to run the notebook with Jupyter Notebook (instead of Jupyter Lab).
The above rendered image shows the same content as when using the proteins.plus web server directly (see Figure 1).
Comparison between DoGSiteScorer and KLIFS pocket
Get DoGSiteScorer pocket residues
Since we want to compare the automatically predicted pocket with the manually assigned KLIFS pocket, we need to identify the pocket residues.
We get the residues of selected pocket and show the top 5 residues.
residue_number | residue_name | |
---|---|---|
0 | 718 | LEU |
4 | 726 | VAL |
11 | 743 | ALA |
12 | 744 | ILE |
20 | 745 | LYS |
Another pocket visualization
We can also use the selected pocket residues to investigate our pocket in 3D.
Get KLIFS pocket residues
We are using the opencadd.databases.klifs
module to extract the binding site residues (PDB residue numbering) as defined in the KLIFS database.
Please refer to Talktorial T012 for detailed information about KLIFS and the opencadd.databases.klifs
module usage.
residue.klifs_id | residue.id | residue.klifs_region_id | residue.klifs_region | residue.klifs_color | |
---|---|---|---|---|---|
0 | 1 | 716 | I.1 | I | khaki |
1 | 2 | 717 | I.2 | I | khaki |
2 | 3 | 718 | I.3 | I | khaki |
3 | 4 | 719 | g.l.4 | g.l | green |
4 | 5 | 720 | g.l.5 | g.l | green |
Overlap of pocket residues
DoGSiteScorer pocket P_0_0 with 29 residues detected. KLIFS pocket with 85 residues detected.
Overlap between the two residue sets: 27.
Plot the overlap as Venn diagram.
Residue overlap between the two methods using P_0_0.
As we can see, the subpocket we choose from DoGSiteScorer is very precise. In fact, 27 out of the 29 residues do overlap with the KLIFS definition (85 residues).
Though, the KLIFS pocket covers a larger overall pocket area.
Out of interest, we can also do the same comparison using the pocket P_0
(instead of subpocket P_0_0
) to see how that compares to KLIFS.
DoGSiteScorer pocket P_0 with 62 detected.
Residue overlap between the two methods using P_0.
Discussion
We introduced the idea of binding site detection algorithms using DoGSiteScorer, which can be queried through the proteins.plus server. Besides investigating into the main (known) binding site, as exemplified above, one can use other predicted pockets to initiate drug design campaigns against a novel, e.g. allosteric or less explored, pocket.
Potential shortcomings: Such algorithms detect the pockets based on a given protein structure (x-ray). If another structure of the same protein is available, applying the algorithm to that structure might lead to (slightly) different pockets due to e.g. conformational changes induced by the bound ligand. Also differences between apo and holo structures do affect the predictions.
Reference
https://github.com/volkamerlab/teachopencadd
Reprint statement
Original title: Binding site detection
Authors:
- Adapted from Abishek Laxmanan Ravi Shankar, 2019, internship at Volkamer lab
- Andrea Volkamer, 2020/21, Volkamer lab, Charité
- Dominique Sydow, 2020/21, Volkamer lab, Charité