Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
TeachOpenCADD | 009 基于配体的药效团
化学信息学
TeachOpenCADD
化学信息学TeachOpenCADD
YangHe
发布于 2023-06-15
1
AI4SCUP-CNS-BBB(v1)

009 基于配体的药效团模型

🏃🏻 快速开始
您可以直接在 Bohrium Notebook 上执行此文档。首先,请点击位于界面顶部的 开始连接 按钮,然后选择 bohrium-notebook:05-31 镜像并选择合适的的机器配置,稍等片刻即可开始运行。

📖 来源
本 Notebook 来自 https://github.com/volkamerlab/teachopencadd,由杨合 📨 修改搬运至 Bohrium Notebook。

代码
文本

Talktorial T009: This talktorial is part of the TeachOpenCADD pipeline described in the first TeachOpenCADD paper, comprising of talktorials T001-T010.

代码
文本

Note: Please run this notebook cell by cell. Running all cells in one is possible also, however, part of the nglview 3D representations might be missing.

代码
文本

Aim of this talktorial

In this talktorial, we use known EGFR ligands, which were selected and aligned in the previous talktorial, to identify donor, acceptor, and hydrophobic pharmacophoric features for each ligand. Those features are then clustered to define an ensemble pharmacophore, which represents the properties of the set of known EGFR ligands and can be used to search for novel EGFR ligands via virtual screening.

代码
文本

Learning goals

代码
文本

Contents in Theory

  • Pharmacophore modeling
    • Structure- and ligand-based pharmacophore modeling
  • Virtual screening with pharmacophores
  • Clustering: k-means
代码
文本

Contents in Practical

  • Get pre-aligned ligands from previous talktorial
  • Show ligands with NGLView
  • Extract pharmacophore features
  • Show the pharmacophore features of all ligands
    • Hydrogen bond donors
    • Hydrogen bond acceptors
    • Hydrophobic contacts
  • Collect coordinates of features per feature type
  • Generate ensemble pharmacophores
    • Set static parameters for k-means clustering
    • Set static parameters for cluster selection
    • Define k-means clustering and cluster selection functions
    • Cluster features
    • Select relevant clusters
    • Get selected cluster coordinates
  • Show clusters
    • Hydrogen bond donors
    • Hydrogen bond acceptors
    • Hydrophobic contacts
  • Show ensemble pharmacophore
代码
文本

References

代码
文本

Theory

代码
文本

Pharmacophores

In computer-aided drug design, the description of drug-target interactions with pharmacophores is a well-established method. The term pharmacophore was defined in 1998 by a IUPAC working party: "A pharmacophore is the ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target structure and to trigger (or to block) its biological response." (Pure & Appl. Chem. (1998), 70, 1129-43)

In other words, a pharmacophore consists of several pharmacophoric features, which describe important steric and physico-chemical properties of a ligand observed to bind a target under investigation. Such physico-chemical properties (also called feature types) can be hydrogen bond donors/acceptors, hydrophobic/aromatic interactions, or positively/negatively charged groups, and the steric properties are defined by the 3D arrangement of these features.

代码
文本

Structure- and ligand-based pharmacophore modeling

In pharmacophore modeling, two main approaches are used, depending on the biological question and available data sources, i.e. structure- and ligand-based pharmacophore modeling.

Structure-based pharmacophore models are derived from protein-ligand complexes. Features are defined by observed interactions between the protein and ligand, ensuring that only those ligand moieties are used for virtual screening that have already been shown to be involved in ligand binding. However, structures of protein-ligand complexes are not available for all targets. In this case, either complex structures can be generated by modeling the ligand into the target binding site, e.g. via molecular docking, or pharmacophore modeling methods can be invoked that only use the target binding site to detect potential protein-ligand interaction sites.

Ligand-based pharmacophore models are based on a set of ligands known to bind the target under investigation. The common chemical features of these ligands build the pharmacophore model. This method is used for targets with multiple known ligands and in case of missing protein-ligand complex structures. In this talktorial, we will use ligand-based pharmacophore modeling using a set of known EGFR ligands.

For more information on pharmacophore modeling, we recommend Pharmacophore Perception and Applications: Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), 1, 259-82 and J. Chem. Inf. Model. (2005), 45, 160-9.

image.png

Figure 1: Structure-based pharmacophore representing protein-ligand interactions. Figure created by Dominique Sydow.

代码
文本

Virtual screening with pharmacophores

As described earlier in Talktorial 4, virtual screening (VS) describes the screening of a query, e.g. a pharmacophore model (T009) or a query compound (T004), against a large library of compounds, in order to identify the small molecules in the library that are most likely to bind a target under investigation (represented by the query). In pharmacophore-based virtual screening, the compound library is matched compound-by-compound into a pharmacophore model and ranked by the best matching results (Structure-Based Virtual Screening: Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), 1, 313-31).

代码
文本

Clustering: k-means

In this talktorial we will generate an ensemble pharmacophore by clustering the feature points of several ligand-based pharmacophores. The clustering algorithm uses the k-means clustering, which aims to cluster a data set into k clusters:

  1. k different centroids are selected and each point of the data set is assigned to its closest centroids.
  2. New centroids are calculated based on the current clusters and each point of the data set is newly assigned to its closest centroids.
  3. This procedure is repeated until the centroids are stable.

More info about k-means in Wikipedia.

代码
文本

Practical

代码
文本
[1]
import collections
from collections import Counter
from pathlib import Path
import operator
import time
import math

import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets, cluster

from rdkit import RDConfig, Chem, Geometry, DistanceGeometry
from rdkit.Chem import (
ChemicalFeatures,
rdDistGeom,
Draw,
rdMolTransforms,
AllChem,
)
from rdkit.Chem.Draw import DrawingOptions
from rdkit.Chem.Pharm3D import Pharmacophore, EmbedLib
from rdkit.Numerics import rdAlignment
import nglview as nv
代码
文本

Set the Path to the current notebook.

代码
文本
[2]
HERE = Path(_dh[-1])
DATA = HERE / "data"
代码
文本

Get pre-aligned ligands from previous talktorial

We retrieve all ligands that were aligned in the previous talktorial.

First, we get the file paths to all ligand PDB files.

代码
文本
[3]
mol_files = sorted(Path(HERE / "../T008_query_pdb/data/").glob("*_lig.pdb"))
print([f.name for f in mol_files])
['5HG8_lig.pdb', '5UG8_lig.pdb', '5UG9_lig.pdb', '5UGC_lig.pdb']
代码
文本
[4]
# NBVAL_CHECK_OUTPUT
pdb_ids = [Path(file).stem.split("_")[0] for file in mol_files]
pdb_ids
['5HG8', '5UG8', '5UG9', '5UGC']
代码
文本

Second, we read all ligands from these PDB files using RDKit.

代码
文本
[5]
# NBVAL_CHECK_OUTPUT
molecules = []
for mol_file in mol_files:
molecule = Chem.MolFromPDBFile(str(mol_file), removeHs=False)
if molecule is None:
print(mol_file, "Could not be read")
else:
Chem.SanitizeMol(molecule)
print(Chem.MolToSmiles(molecule))
molecules.append(molecule)
print(f"Number of molecules: {len(molecules)}")
CCC(O)NC1CCCC(OC2NC(NC3CNN(C)C3)NC3NCCC32)C1
CCC(O)NC1CN(C2NC(NC3CNN(C)C3)C3NCN(C(C)C)C3N2)CC1F
CCC(O)NC1CN(C2NC(NC3CN(C)NC3OC)C3NCN(C(C)C)C3N2)CC1F
CCC(O)NC1CN(C2NC(NC3CN(C)NC3OC)C3NCN(C)C3N2)CC1F
Number of molecules: 4
代码
文本
[6]
Draw.MolsToGridImage(
molecules,
molsPerRow=4,
legends=[f"From PDB ID: {pdb_id}" for pdb_id in pdb_ids],
)
代码
文本

Upon inspection of the structures we notice, that we encountered a problem here: When loading ligands from a PDB file, RDKit does not assign bonds in structures like aromatic rings correctly. We use the RDKit function AssignBondOrdersFromTemplate, which assigns bonds to a molecule based on a reference molecule, e.g. in our case based on the SMILES pattern of the molecule.

Check for further information: RDKit discussion on Aromaticity of non-protein molecules in PDB not detected and RDKit documentation on AssignBondOrdersFromTemplate.

代码
文本
[7]
# Load SMILES for PDB ligand structures
ligands = pd.read_csv(HERE / "../T008_query_pdb/data/PDB_top_ligands.csv")

# Get SMILES in the same order as in pdb_ids
ligand_smiles = [
ligands[ligands["@structureId"] == pdb_id]["smiles"].values[0] for pdb_id in pdb_ids
]

# Generate RDKit Mol object from SMILES
reference_molecules = [Chem.MolFromSmiles(smiles) for smiles in ligand_smiles]

# Assign bond orders to molecules based on SMILES patterns of reference molecules
molecules = [
AllChem.AssignBondOrdersFromTemplate(reference_molecule, molecule)
for reference_molecule, molecule in zip(reference_molecules, molecules)
]
[00:05:46] WARNING: More than one matching pattern found - picking one

[00:05:46] WARNING: More than one matching pattern found - picking one

代码
文本
[8]
Draw.MolsToGridImage(
molecules,
molsPerRow=4,
legends=[f"From PDB ID: {pdb_id}" for pdb_id in pdb_ids],
)
代码
文本

Now the correct bonds are included. We can also have a look at the molecules in 2D. In order to keep the original coordinates, we make a copy for the 2D representation.

代码
文本
[9]
molecules_2d = []
for molecule in molecules:
tmp = Chem.Mol(molecule)
AllChem.Compute2DCoords(tmp)
molecules_2d.append(tmp)
Draw.MolsToGridImage(
molecules_2d,
molsPerRow=4,
legends=[f"From PDB ID: {pdb_id}" for pdb_id in pdb_ids],
)
代码
文本

Visualize with nglview

代码
文本

In the next step we visualize the ligand molecules with nglview using the following function. The molecules will be drawn on top of each other in a single view. This way we are able to visually compare strucural differences between the ligands.

代码
文本
[10]
def show_ligands(molecules):
"""Generate a view of the ligand molecules.

Parameters
-----------
molecules: list of rdkit.Chem.rdchem.Mol

Returns
----------
nglview.widget.NGLWidget
"""
view = nv.NGLWidget()
for molecule in molecules:
component = view.add_component(molecule)
time.sleep(0.1)
component.clear()
component.add_ball_and_stick(multipleBond=True)
return view
代码
文本
[11]
view = show_ligands(molecules)
view
代码
文本
[12]
view.render_image(trim=True, factor=2, transparent=True);
代码
文本
[13]
view._display_image()
代码
文本

Extract pharmacophore features

As described above, the aim of this talktorial is to generate a ligand-based ensemble pharmacophore from a set of ligands. First, we need to extract pharmacophore features per ligand. Therefore, we load a feature factory. We use the default feature definitions.

See also the RDKit documentation on chemical features and pharmacophores.

代码
文本
[14]
feature_factory = AllChem.BuildFeatureFactory(str(Path(RDConfig.RDDataDir) / "BaseFeatures.fdef"))
代码
文本

We take a look at the pharmacophore features that are implemented in RDKit:

代码
文本
[15]
list(feature_factory.GetFeatureDefs().keys())
['Donor.SingleAtomDonor',
 'Acceptor.SingleAtomAcceptor',
 'NegIonizable.AcidicGroup',
 'PosIonizable.BasicGroup',
 'PosIonizable.PosN',
 'PosIonizable.Imidazole',
 'PosIonizable.Guanidine',
 'ZnBinder.ZnBinder1',
 'ZnBinder.ZnBinder2',
 'ZnBinder.ZnBinder3',
 'ZnBinder.ZnBinder4',
 'ZnBinder.ZnBinder5',
 'ZnBinder.ZnBinder6',
 'Aromatic.Arom4',
 'Aromatic.Arom5',
 'Aromatic.Arom6',
 'Aromatic.Arom7',
 'Aromatic.Arom8',
 'Hydrophobe.ThreeWayAttach',
 'Hydrophobe.ChainTwoWayAttach',
 'LumpedHydrophobe.Nitro2',
 'LumpedHydrophobe.RH6_6',
 'LumpedHydrophobe.RH5_5',
 'LumpedHydrophobe.RH4_4',
 'LumpedHydrophobe.RH3_3',
 'LumpedHydrophobe.tButyl',
 'LumpedHydrophobe.iPropyl']
代码
文本

As an example, we extract all features from the first molecule.

代码
文本
[16]
example_molecule = molecules[0]
example_molecule
代码
文本
[17]
# NBVAL_CHECK_OUTPUT
features = feature_factory.GetFeaturesForMol(example_molecule)
print(f"Number of features found: {len(features)}")
Number of features found: 16
代码
文本

The type of a feature is called family in RDKit and can be retrieved with GetFamily().

代码
文本
[18]
# NBVAL_CHECK_OUTPUT
features[0].GetFamily()
'Donor'
代码
文本

First we get the frequency of feature types for our example molecule.

代码
文本
[19]
# NBVAL_CHECK_OUTPUT
feature_frequency = collections.Counter(sorted([feature.GetFamily() for feature in features]))
feature_frequency
Counter({'Acceptor': 5,
         'Aromatic': 4,
         'Donor': 4,
         'Hydrophobe': 2,
         'LumpedHydrophobe': 1})
代码
文本

Now apply the functions shown above to all molecules in our ligand set. We calculate the feature type frequency per molecule, then we display it in a DataFrame.

代码
文本
[20]
molecule_feature_frequencies = []
for molecule in molecules:
features = [feature.GetFamily() for feature in feature_factory.GetFeaturesForMol(molecule)]
feature_frequency = collections.Counter(features)
molecule_feature_frequencies.append(feature_frequency)

feature_frequencies_df = (
pd.DataFrame(
molecule_feature_frequencies,
index=[f"Mol{i}" for i, _ in enumerate(molecules, 1)],
)
.fillna(0)
.astype(int)
)
feature_frequencies_df.transpose()
Mol1 Mol2 Mol3 Mol4
Donor 4 2 2 2
Acceptor 5 6 7 7
Aromatic 4 3 3 3
Hydrophobe 2 1 1 1
LumpedHydrophobe 1 1 1 0
PosIonizable 0 1 1 1
代码
文本

Furtheron, we concentrate in this talktorial only on the following feature types: hydrogen bond acceptors (acceptors), hydrogen bond donors (donors), and hydrophobic contacts (hydrophobics).

We retrieve the feature RDKit objects per feature type and per molecule.

代码
文本
[21]
acceptors = []
donors = []
hydrophobics = []

for molecule in molecules:
acceptors.append(feature_factory.GetFeaturesForMol(molecule, includeOnly="Acceptor"))
donors.append(feature_factory.GetFeaturesForMol(molecule, includeOnly="Donor"))
hydrophobics.append(feature_factory.GetFeaturesForMol(molecule, includeOnly="Hydrophobe"))

features = {
"donors": donors,
"acceptors": acceptors,
"hydrophobics": hydrophobics,
}
代码
文本

Show the pharmacophore features of all ligands

Pharmacophore feature types usually are displayed in defined colors, e.g. usually hydrogen bond donors, hydrogen bond acceptors, and hydrophobic contacts are colored green, red, and yellow, respectively.

代码
文本
[22]
feature_colors = {
"donors": (0, 0.9, 0), # Green
"acceptors": (0.9, 0, 0), # Red
"hydrophobics": (1, 0.9, 0), # Yellow
}
代码
文本
[23]
def visualize_features(
molecules,
features,
feature_type="features",
color="yellow",
sphere_radius=0.5,
):
"""Generate a view of the molecules highlighting the specified feature type.

Parameters
-----------
molecules: list of rdkit.Chem.rdchem.Mol
molecules to be visualized
features: list of tuples of rdkit.Chem.rdMolChemicalFeatures.MolChemicalFeature
extracted features from molecules chosen to be highlighted
feature_type: string, optional
name of the feature to be highlighted
color: string, optional
color used to display the highlighted features
sphere_radius: float, optional
display size of the highlighted features

Returns
----------
nglview.widget.NGLWidget
"""
print(f"Number of {feature_type} in all ligands: {sum([len(i) for i in features])}")
view = show_ligands(molecules)
for i, feature_set in enumerate(features, 1):
for feature in feature_set:
loc = list(feature.GetPos())
label = f"{feature_type}_{i}"
view.shape.add_sphere(loc, color, sphere_radius, label)
return view
代码
文本

We use this function to visualize the features for the feature types under consideration.

代码
文本

Hydrogen bond donors

代码
文本
[24]
feature_type = "donors"
view = visualize_features(
molecules,
features[feature_type],
feature_type,
feature_colors[feature_type],
)
view
Number of donors in all ligands: 10
代码
文本
[25]
view.render_image(trim=True, factor=2, transparent=True);
代码
文本
[26]
view._display_image()
代码
文本

Hydrogen bond acceptors

代码
文本
[27]
feature_type = "acceptors"
view = visualize_features(
molecules,
features[feature_type],
feature_type,
feature_colors[feature_type],
)
view
Number of acceptors in all ligands: 25
代码
文本
[28]
view.render_image(trim=True, factor=2, transparent=True);
代码
文本
[29]
view._display_image()
代码
文本

Hydrophobic contacts

代码
文本
[30]
feature_type = "hydrophobics"
view = visualize_features(
molecules,
features[feature_type],
feature_type,
feature_colors[feature_type],
)
view
Number of hydrophobics in all ligands: 5
代码
文本
[31]
view.render_image(trim=True, factor=2, transparent=True);
代码
文本
[32]
view._display_image()
代码
文本

Collect coordinates of features per feature type

We want to cluster features per feature type. To do this we collect all coordinates of features and store them per feature type.

代码
文本
[33]
features_coord = {
"donors": [list(item.GetPos()) for sublist in features["donors"] for item in sublist],
"acceptors": [list(item.GetPos()) for sublist in features["acceptors"] for item in sublist],
"hydrophobics": [
list(item.GetPos()) for sublist in features["hydrophobics"] for item in sublist
],
}
代码
文本

Now, we have the positions of e.g. all acceptor features:

代码
文本
[34]
features_coord["acceptors"]
[[-13.49, 14.732, -27.925],
 [-11.99, 13.164, -28.961],
 [-13.173, 17.112, -28.025],
 [-16.131, 17.359, -22.003],
 [-16.873, 11.631, -25.856],
 [-13.808, 14.802, -27.137],
 [-12.146, 16.255, -28.067],
 [-11.385, 12.957, -29.278],
 [-16.736, 10.854, -25.879],
 [-15.435, 19.394, -27.21],
 [-16.039, 17.251, -22.54],
 [-13.863, 14.66, -27.134],
 [-12.175, 16.118, -27.994],
 [-11.411, 12.856, -29.266],
 [-17.026, 10.809, -26.396],
 [-15.601, 9.901, -28.086],
 [-15.445, 19.229, -27.215],
 [-16.044, 17.165, -22.506],
 [-13.831, 14.581, -27.162],
 [-12.161, 16.037, -28.079],
 [-11.381, 12.733, -29.268],
 [-17.02, 10.739, -26.48],
 [-15.542, 9.814, -28.111],
 [-15.515, 19.1, -27.334],
 [-16.09, 17.203, -22.553]]
代码
文本

Generate ensemble pharmacophores

In order to generate ensemble pharmacophores, we use k-means clustering to cluster features per feature type.

代码
文本

Set static parameters for k-means clustering

  • kq: With this paramter, we determine the number of clusters k per feature type depending on the number of feature points, i.e. per feature type
  • k = number_of_features / kq
  • kq should be selected so that k (feature clusters) is for all clusters at least 1 and not larger than 4-5 clusters

We set our kq used to determine k in k-means as follows:

代码
文本
[35]
kq = 7
代码
文本

Set static parameters for cluster selection

  • min_cluster_size: We only want to retain clusters that potentially contain features from most molecules in our ligand ensemble. Therefore, we set this variable to 75% of the number of molecules in our ligand ensemble.
  • top_cluster_number: With this parameter, we select only the largest clusters.
代码
文本
[36]
min_cluster_size = int(len(molecules) * 0.75)
top_cluster_number = 4
代码
文本

Define k-means clustering and cluster selection functions

We define a function that calculates the centers of clusters, which are derived from k-means clustering.

代码
文本
[37]
def clustering(feature_coord, kq):
"""
Compute the k-means clustering of input feature coordinates.

Parameters
-----------
feature_coord: list
feature coordinates extracted from the features of a set of molecules
kq: int
Quotient used when calculating number of features k

Returns
----------
sklearn.cluster._kmeans.KMeans
"""
# Define parameter k as feature number divided by "k quotient"
k = math.ceil(len(feature_coord) / kq)
# Tailor-made adaption of k for hydrophobics in for the example in this talktorial
k = 2 if k == 1 else k
print(f"Clustering: \nVariable k in k-means: {k} of {len(feature_coord)} points\n")

# Initialize k-means and compute clustering
k_means = cluster.KMeans(n_clusters=k, n_init=10)
k_means.fit(feature_coord)

return k_means
代码
文本

We define a function that sorts the clusters by size and outputs a list of indices of the largest clusters.

代码
文本
[38]
def get_clusters(k_means, min_cluster_size, top_cluster_number):
"""
Select clusters from a k-means clustering and returns the indices of the selected clusters.
The selection is based on cluster size. The largest clusters are chosen.

Parameters
-----------
k_means: sklearn.cluster._kmeans.KMeans
k-means clustering of feature coordinates
min_cluster_size: int
minimum size needed for a cluster to be viable for selection
top_cluster_number: int
number of clusters to be selected

Returns
----------
list of int
indices of selected clusters
"""

# Sort clusters by size and only show largest
feature_labels = k_means.labels_

feature_labels_count = Counter(feature_labels)

feature_labels_count = sorted(
feature_labels_count.items(), key=operator.itemgetter(1), reverse=True
)
print(f"Clusters sorted by size: \n{feature_labels_count}\n")

# Get number of the largest clusters, which are larger then the threshold (selected clusters)
cluster_indices_sel = []

for cluster_index, cluster_size in feature_labels_count:
if cluster_size >= min_cluster_size and top_cluster_number > 0:
cluster_indices_sel.append(cluster_index)
top_cluster_number -= 1

print(f"Cluster indices of selected clusters: \n{cluster_indices_sel}\n")

return cluster_indices_sel
代码
文本

Cluster features

For each feature type, we perform the k-means clustering with our defined clustering function.

代码
文本
[39]
k_means = {
"donors": clustering(features_coord["donors"], kq),
"acceptors": clustering(features_coord["acceptors"], kq),
"hydrophobics": clustering(features_coord["hydrophobics"], kq),
}
Clustering: 
Variable k in k-means: 2 of 10 points

Clustering: 
Variable k in k-means: 4 of 25 points

Clustering: 
Variable k in k-means: 2 of 5 points

代码
文本

Select relevant clusters

For each feature type, we select relevant clusters with our defined get_clusters function.

代码
文本
[40]
print("Hydrogen bond donors\n")
cluster_indices_sel_donors = get_clusters(k_means["donors"], min_cluster_size, top_cluster_number)
Hydrogen bond donors

Clusters sorted by size: 
[(1, 6), (0, 4)]

Cluster indices of selected clusters: 
[1, 0]

代码
文本
[41]
print("Hydrogen bond acceptors\n")
cluster_indices_sel_acceptors = get_clusters(
k_means["acceptors"], min_cluster_size, top_cluster_number
)
Hydrogen bond acceptors

Clusters sorted by size: 
[(0, 12), (2, 6), (1, 4), (3, 3)]

Cluster indices of selected clusters: 
[0, 2, 1, 3]

代码
文本
[42]
print("Hydrophobic contacts\n")
cluster_indices_sel_hydrophobic = get_clusters(
k_means["hydrophobics"], min_cluster_size, top_cluster_number
)
Hydrophobic contacts

Clusters sorted by size: 
[(0, 4), (1, 1)]

Cluster indices of selected clusters: 
[0]

代码
文本
[43]
cluster_indices_sel = {
"donors": cluster_indices_sel_donors,
"acceptors": cluster_indices_sel_acceptors,
"hydrophobics": cluster_indices_sel_hydrophobic,
}
代码
文本

Get selected cluster coordinates

代码
文本
[44]
def get_selected_cluster_center_coords(k_means, cluster_indices_sel):
"""Retrieve cluster center coordinates for selected clusters.
The selected clusters are defined by their indices.

Parameters
-----------
k_means: sklearn.cluster._kmeans.KMeans
k-means clustering of feature coordinates
cluster_indices_sel: list of int
indices of the chosen clusters

Returns
----------
two-dimensional list of floats
cluster center coordinates
"""
cluster_centers = k_means.cluster_centers_

# Cast to list and then to pandas Series (for element selection by indices)
cluster_centers = pd.Series(cluster_centers.tolist())

# Select cluster centers by indices of selected clusters
cluster_centers_sel = cluster_centers[cluster_indices_sel]

return list(cluster_centers_sel)
代码
文本
[45]
cluster_centers_sel = {
"donors": get_selected_cluster_center_coords(k_means["donors"], cluster_indices_sel["donors"]),
"acceptors": get_selected_cluster_center_coords(
k_means["acceptors"], cluster_indices_sel["acceptors"]
),
"hydrophobics": get_selected_cluster_center_coords(
k_means["hydrophobics"], cluster_indices_sel["hydrophobics"]
),
}
代码
文本
[46]
cluster_centers_sel["acceptors"]
[[-12.567833333333333, 14.66725, -28.191333333333333],
 [-16.466333333333335, 10.624666666666666, -26.801333333333332],
 [-16.076, 17.244500000000002, -22.4005],
 [-15.465, 19.241, -27.253]]
代码
文本

Show clusters

Per feature type, we visualize cluster centers alongside with all molecules and all feature points.

代码
文本
[47]
def visualize_clusters(
molecules,
features,
clusters,
feature_type="features",
color="yellow",
feature_radius=0.5,
cluster_radius=1,
):
"""Visualize the cluster centers.

Parameters
-----------
molecules: list of rdkit.Chem.rdchem.Mol
molecules to be visualized
features: list of tuples of rdkit.Chem.rdMolChemicalFeatures.MolChemicalFeature
extracted features from molecules chosen to be highlighted
clusters: Two-dimensional list of floats
cluster center coordinates
feature_type: string, optional
feature type being visualized
color: string, optional
color used for rendering the features and cluster centers
feature_radius: float, optional
rendering size of the features
cluster_radius: float, optional
rendering size of the cluster centers

Returns
----------
nglview.widget.NGLWidget
"""
view = visualize_features(
molecules,
features,
feature_type,
color=color,
sphere_radius=feature_radius,
)
for i, center in enumerate(clusters, 1):
view.shape.add_sphere(list(center), color, cluster_radius, f"cluster_{feature_type}_{i}")
return view
代码
文本

Hydrogen bond donors

代码
文本
[48]
feature_type = "donors"
view = visualize_clusters(
molecules,
features[feature_type],
cluster_centers_sel[feature_type],
feature_type,
feature_colors[feature_type],
)
view
Number of donors in all ligands: 10
代码
文本
[49]
view.render_image(trim=True, factor=2, transparent=True);
代码
文本
[50]
view._display_image()
代码
文本

Hydrogen bond acceptor

代码
文本
[51]
feature_type = "acceptors"
view = visualize_clusters(
molecules,
features[feature_type],
cluster_centers_sel[feature_type],
feature_type,
feature_colors[feature_type],
)
view
Number of acceptors in all ligands: 25
代码
文本
[52]
view.render_image(trim=True, factor=2, transparent=True);
代码
文本
[53]
view._display_image()
代码
文本

Hydrophobic contacts

代码
文本
[54]
feature_type = "hydrophobics"
view = visualize_clusters(
molecules,
features[feature_type],
cluster_centers_sel[feature_type],
feature_type,
feature_colors[feature_type],
)
view
Number of hydrophobics in all ligands: 5
代码
文本
[55]
view.render_image(trim=True, factor=2, transparent=True);
代码
文本
[56]
view._display_image()
代码
文本

Show ensemble pharmacophore

In this last step, we combine the clustered pharmacophoric features (i.e. hydrogen bond donors and acceptors as well as hydrophobic contacts), to one ensemble pharmacophore, representing the pharmacophoric properties of the four selected ligands.

代码
文本
[57]
view = show_ligands(molecules)
# Load clusters
for feature_type in cluster_indices_sel.keys():
centers = cluster_centers_sel[feature_type]
for i, loc in enumerate(centers):
sphere_radius = 1
feature_color = feature_colors[feature_type]
label = f"{feature_type}_c{i}"
view.shape.add_sphere(loc, feature_color, sphere_radius, label)
view
代码
文本
[58]
view.render_image(trim=True, factor=2, transparent=True);
代码
文本
[59]
view._display_image()
代码
文本

Discussion

In this talktorial, we used a set of pre-aligned ligands, which are known to bind EGFR, to generate an ensemble pharmacophore model. This model could now be used for virtual screening against a large library of small molecules, in order to find novel small molecules that show the observed steric and physico-chemical properties and might therefore also bind to the EGFR binding site.

Before screening, the pharmacophore models are usually further optimized, e.g. features might be omitted in order to reduce the number of features for screening based on biological knowledge (some interaction might be reported as important whereas others are not) or based on chemical expertise.

We do not cover the virtual screening in this talktorial, however refer to an excellent tutorial by Nik Stiefl, demonstrating pharmacophore modeling and virtual screening with RDKit (RDKit UGM 2016 on GitHub).

We used k-means clustering to cluster pharmacophore feature. This clustering approach has the disadvantage that the user needs to define the number of clusters beforehand, which is usually based on a visual inspection of the point distribution before clustering (or during cluster refinement) and is therefore hindering for an automated pharmacophore generation. Density-based clustering methods, also in combination with k-means clustering, can be a solution for this. For further reading we recommend the introduction to Density-Based Clustering at Domino Data Lab.

Reference

https://github.com/volkamerlab/teachopencadd

Reprint statement

Original title: Ligand-based pharmacophores

Authors:

  • Pratik Dhakal, CADD seminar, 2017, Charité/FU Berlin
  • Florian Gusewski, CADD seminar, 2018, Charité/FU Berlin
  • Jaime Rodríguez-Guerra, Volkamer lab, Charité
  • Dominique Sydow, Volkamer lab, Charité
代码
文本
化学信息学
TeachOpenCADD
化学信息学TeachOpenCADD
点个赞吧
本文被以下合集收录
化学
bohr7916f3
更新于 2024-08-22
1 篇0 人关注
AI4S
凝聚态平方
更新于 2024-03-18
32 篇0 人关注
推荐阅读
公开
TeachOpenCADD | 002 化合物的ADME
化学信息学TeachOpenCADD
化学信息学TeachOpenCADD
YangHe
发布于 2023-06-15
1 转存文件
公开
TeachOpenCADD | 005 化合物的聚类
化学信息学TeachOpenCADD
化学信息学TeachOpenCADD
YangHe
发布于 2023-06-15