空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

TeachOpenCADD | 009 基于配体的药效团

化学信息学

TeachOpenCADD

化学信息学TeachOpenCADD

YangHe

发布于 2023-06-15

数据集

AI4SCUP-CNS-BBB(v1)

009 基于配体的药效团模型

🏃🏻 快速开始
您可以直接在 Bohrium Notebook 上执行此文档。首先，请点击位于界面顶部的开始连接按钮，然后选择 bohrium-notebook:05-31 镜像并选择合适的的机器配置，稍等片刻即可开始运行。

📖 来源
本 Notebook 来自 https://github.com/volkamerlab/teachopencadd，由杨合 📨 修改搬运至 Bohrium Notebook。

代码

文本

Talktorial T009: This talktorial is part of the TeachOpenCADD pipeline described in the first TeachOpenCADD paper, comprising of talktorials T001-T010.

代码

文本

Note: Please run this notebook cell by cell. Running all cells in one is possible also, however, part of the nglview 3D representations might be missing.

代码

文本

Aim of this talktorial

In this talktorial, we use known EGFR ligands, which were selected and aligned in the previous talktorial, to identify donor, acceptor, and hydrophobic pharmacophoric features for each ligand. Those features are then clustered to define an ensemble pharmacophore, which represents the properties of the set of known EGFR ligands and can be used to search for novel EGFR ligands via virtual screening.

代码

文本

Learning goals

代码

文本

Contents in Theory

Pharmacophore modeling
- Structure- and ligand-based pharmacophore modeling
Virtual screening with pharmacophores
Clustering: k-means

代码

文本

Contents in Practical

Get pre-aligned ligands from previous talktorial
Show ligands with NGLView
Extract pharmacophore features
Show the pharmacophore features of all ligands
- Hydrogen bond donors
- Hydrogen bond acceptors
- Hydrophobic contacts
Collect coordinates of features per feature type
Generate ensemble pharmacophores
- Set static parameters for k-means clustering
- Set static parameters for cluster selection
- Define k-means clustering and cluster selection functions
- Cluster features
- Select relevant clusters
- Get selected cluster coordinates
Show clusters
- Hydrogen bond donors
- Hydrogen bond acceptors
- Hydrophobic contacts
Show ensemble pharmacophore

代码

文本

References

IUPAC pharmacophore definition (Pure & Appl. Chem (1998), 70, 1129-43)
3D pharmacophores in LigandScout (J. Chem. Inf. Model. (2005), 45, 160-9)
Book chapter: Pharmacophore Perception and Applications (Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), 1, 259-82)
Book chapter: Structure-Based Virtual Screening (Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), 1, 313-31).
Monty Kier and the origin of the pharmacophore concept (Internet Electron. J. Mol. Des. (2007), 6, 271-9)
Nik Stiefl's demonstration of pharmacophore modeling with RDKit (RDKit UGM 2016 on GitHub)

代码

文本

Theory

代码

文本

Pharmacophores

In computer-aided drug design, the description of drug-target interactions with pharmacophores is a well-established method. The term pharmacophore was defined in 1998 by a IUPAC working party: "A pharmacophore is the ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target structure and to trigger (or to block) its biological response." (Pure & Appl. Chem. (1998), 70, 1129-43)

In other words, a pharmacophore consists of several pharmacophoric features, which describe important steric and physico-chemical properties of a ligand observed to bind a target under investigation. Such physico-chemical properties (also called feature types) can be hydrogen bond donors/acceptors, hydrophobic/aromatic interactions, or positively/negatively charged groups, and the steric properties are defined by the 3D arrangement of these features.

代码

文本

Structure- and ligand-based pharmacophore modeling

In pharmacophore modeling, two main approaches are used, depending on the biological question and available data sources, i.e. structure- and ligand-based pharmacophore modeling.

Structure-based pharmacophore models are derived from protein-ligand complexes. Features are defined by observed interactions between the protein and ligand, ensuring that only those ligand moieties are used for virtual screening that have already been shown to be involved in ligand binding. However, structures of protein-ligand complexes are not available for all targets. In this case, either complex structures can be generated by modeling the ligand into the target binding site, e.g. via molecular docking, or pharmacophore modeling methods can be invoked that only use the target binding site to detect potential protein-ligand interaction sites.

Ligand-based pharmacophore models are based on a set of ligands known to bind the target under investigation. The common chemical features of these ligands build the pharmacophore model. This method is used for targets with multiple known ligands and in case of missing protein-ligand complex structures. In this talktorial, we will use ligand-based pharmacophore modeling using a set of known EGFR ligands.

For more information on pharmacophore modeling, we recommend Pharmacophore Perception and Applications: Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), 1, 259-82 and J. Chem. Inf. Model. (2005), 45, 160-9.

Figure 1: Structure-based pharmacophore representing protein-ligand interactions. Figure created by Dominique Sydow.

代码

文本

Virtual screening with pharmacophores

As described earlier in Talktorial 4, virtual screening (VS) describes the screening of a query, e.g. a pharmacophore model (T009) or a query compound (T004), against a large library of compounds, in order to identify the small molecules in the library that are most likely to bind a target under investigation (represented by the query). In pharmacophore-based virtual screening, the compound library is matched compound-by-compound into a pharmacophore model and ranked by the best matching results (Structure-Based Virtual Screening: Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), 1, 313-31).

代码

文本

Clustering: k-means

In this talktorial we will generate an ensemble pharmacophore by clustering the feature points of several ligand-based pharmacophores. The clustering algorithm uses the k-means clustering, which aims to cluster a data set into k clusters:

k different centroids are selected and each point of the data set is assigned to its closest centroids.
New centroids are calculated based on the current clusters and each point of the data set is newly assigned to its closest centroids.
This procedure is repeated until the centroids are stable.

More info about k-means in Wikipedia.

代码

文本

Practical

代码

文本

[1]

import collections

from collections import Counter

from pathlib import Path

import operator

import time

import math

import pandas as pd

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from sklearn import datasets, cluster

from rdkit import RDConfig, Chem, Geometry, DistanceGeometry

from rdkit.Chem import (

ChemicalFeatures,

rdDistGeom,

Draw,

rdMolTransforms,

AllChem,

)

from rdkit.Chem.Draw import DrawingOptions

from rdkit.Chem.Pharm3D import Pharmacophore, EmbedLib

from rdkit.Numerics import rdAlignment

import nglview as nv

代码

文本

Set the Path to the current notebook.

代码

文本

[2]

HERE = Path(_dh[-1])

DATA = HERE / "data"

代码

文本

Get pre-aligned ligands from previous talktorial

We retrieve all ligands that were aligned in the previous talktorial.

First, we get the file paths to all ligand PDB files.

代码

文本

[3]

mol_files = sorted(Path(HERE / "../T008_query_pdb/data/").glob("*_lig.pdb"))

print([f.name for f in mol_files])

['5HG8_lig.pdb', '5UG8_lig.pdb', '5UG9_lig.pdb', '5UGC_lig.pdb']

代码

文本

[4]

# NBVAL_CHECK_OUTPUT

pdb_ids = [Path(file).stem.split("_")[0] for file in mol_files]

pdb_ids

['5HG8', '5UG8', '5UG9', '5UGC']

代码

文本

Second, we read all ligands from these PDB files using RDKit.

代码

文本

[5]

# NBVAL_CHECK_OUTPUT

molecules = []

for mol_file in mol_files:

molecule = Chem.MolFromPDBFile(str(mol_file), removeHs=False)

if molecule is None:

print(mol_file, "Could not be read")

else:

Chem.SanitizeMol(molecule)

print(Chem.MolToSmiles(molecule))

molecules.append(molecule)

print(f"Number of molecules: {len(molecules)}")

CCC(O)NC1CCCC(OC2NC(NC3CNN(C)C3)NC3NCCC32)C1
CCC(O)NC1CN(C2NC(NC3CNN(C)C3)C3NCN(C(C)C)C3N2)CC1F
CCC(O)NC1CN(C2NC(NC3CN(C)NC3OC)C3NCN(C(C)C)C3N2)CC1F
CCC(O)NC1CN(C2NC(NC3CN(C)NC3OC)C3NCN(C)C3N2)CC1F
Number of molecules: 4

代码

文本

[6]

Draw.MolsToGridImage(

molecules,

molsPerRow=4,

legends=[f"From PDB ID: {pdb_id}" for pdb_id in pdb_ids],

)

代码

文本

Upon inspection of the structures we notice, that we encountered a problem here: When loading ligands from a PDB file, RDKit does not assign bonds in structures like aromatic rings correctly. We use the RDKit function AssignBondOrdersFromTemplate, which assigns bonds to a molecule based on a reference molecule, e.g. in our case based on the SMILES pattern of the molecule.

Check for further information: RDKit discussion on Aromaticity of non-protein molecules in PDB not detected and RDKit documentation on AssignBondOrdersFromTemplate.

代码

文本

[7]

# Load SMILES for PDB ligand structures

ligands = pd.read_csv(HERE / "../T008_query_pdb/data/PDB_top_ligands.csv")

# Get SMILES in the same order as in pdb_ids

ligand_smiles = [

ligands[ligands["@structureId"] == pdb_id]["smiles"].values[0] for pdb_id in pdb_ids

]

# Generate RDKit Mol object from SMILES

reference_molecules = [Chem.MolFromSmiles(smiles) for smiles in ligand_smiles]

# Assign bond orders to molecules based on SMILES patterns of reference molecules

molecules = [

AllChem.AssignBondOrdersFromTemplate(reference_molecule, molecule)

for reference_molecule, molecule in zip(reference_molecules, molecules)

]

[00:05:46] WARNING: More than one matching pattern found - picking one

[00:05:46] WARNING: More than one matching pattern found - picking one

代码

文本

[8]

Draw.MolsToGridImage(

molecules,

molsPerRow=4,

legends=[f"From PDB ID: {pdb_id}" for pdb_id in pdb_ids],

)

代码

文本

Now the correct bonds are included. We can also have a look at the molecules in 2D. In order to keep the original coordinates, we make a copy for the 2D representation.

代码

文本

[9]

molecules_2d = []

for molecule in molecules:

tmp = Chem.Mol(molecule)

AllChem.Compute2DCoords(tmp)

molecules_2d.append(tmp)

Draw.MolsToGridImage(

molecules_2d,

molsPerRow=4,

legends=[f"From PDB ID: {pdb_id}" for pdb_id in pdb_ids],

)

代码

文本

Visualize with `nglview`

代码

文本

In the next step we visualize the ligand molecules with nglview using the following function. The molecules will be drawn on top of each other in a single view. This way we are able to visually compare strucural differences between the ligands.

代码

文本

[10]

def show_ligands(molecules):

"""Generate a view of the ligand molecules.

Parameters

-----------

molecules: list of rdkit.Chem.rdchem.Mol

Returns

----------

nglview.widget.NGLWidget

"""

view = nv.NGLWidget()

for molecule in molecules:

component = view.add_component(molecule)

time.sleep(0.1)

component.clear()

component.add_ball_and_stick(multipleBond=True)

return view

代码

文本

[11]

view = show_ligands(molecules)

view

代码

文本

[12]

view.render_image(trim=True, factor=2, transparent=True);

代码

文本

[13]

view._display_image()

代码

文本

Extract pharmacophore features

As described above, the aim of this talktorial is to generate a ligand-based ensemble pharmacophore from a set of ligands. First, we need to extract pharmacophore features per ligand. Therefore, we load a feature factory. We use the default feature definitions.

代码

文本

[14]

feature_factory = AllChem.BuildFeatureFactory(str(Path(RDConfig.RDDataDir) / "BaseFeatures.fdef"))

代码

文本

We take a look at the pharmacophore features that are implemented in RDKit:

代码

文本

[15]

list(feature_factory.GetFeatureDefs().keys())

['Donor.SingleAtomDonor',
 'Acceptor.SingleAtomAcceptor',
 'NegIonizable.AcidicGroup',
 'PosIonizable.BasicGroup',
 'PosIonizable.PosN',
 'PosIonizable.Imidazole',
 'PosIonizable.Guanidine',
 'ZnBinder.ZnBinder1',
 'ZnBinder.ZnBinder2',
 'ZnBinder.ZnBinder3',
 'ZnBinder.ZnBinder4',
 'ZnBinder.ZnBinder5',
 'ZnBinder.ZnBinder6',
 'Aromatic.Arom4',
 'Aromatic.Arom5',
 'Aromatic.Arom6',
 'Aromatic.Arom7',
 'Aromatic.Arom8',
 'Hydrophobe.ThreeWayAttach',
 'Hydrophobe.ChainTwoWayAttach',
 'LumpedHydrophobe.Nitro2',
 'LumpedHydrophobe.RH6_6',
 'LumpedHydrophobe.RH5_5',
 'LumpedHydrophobe.RH4_4',
 'LumpedHydrophobe.RH3_3',
 'LumpedHydrophobe.tButyl',
 'LumpedHydrophobe.iPropyl']

代码

文本

As an example, we extract all features from the first molecule.

代码

文本

[16]

example_molecule = molecules[0]

example_molecule

代码

文本

[17]

# NBVAL_CHECK_OUTPUT

features = feature_factory.GetFeaturesForMol(example_molecule)

print(f"Number of features found: {len(features)}")

Number of features found: 16

代码

文本

The type of a feature is called family in RDKit and can be retrieved with GetFamily().

代码

文本

[18]

# NBVAL_CHECK_OUTPUT

features[0].GetFamily()

'Donor'

代码

文本

First we get the frequency of feature types for our example molecule.

代码

文本

[19]

# NBVAL_CHECK_OUTPUT

feature_frequency = collections.Counter(sorted([feature.GetFamily() for feature in features]))

feature_frequency

Counter({'Acceptor': 5,
         'Aromatic': 4,
         'Donor': 4,
         'Hydrophobe': 2,
         'LumpedHydrophobe': 1})

代码

文本

Now apply the functions shown above to all molecules in our ligand set. We calculate the feature type frequency per molecule, then we display it in a DataFrame.

代码

文本

[20]

molecule_feature_frequencies = []

for molecule in molecules:

features = [feature.GetFamily() for feature in feature_factory.GetFeaturesForMol(molecule)]

feature_frequency = collections.Counter(features)

molecule_feature_frequencies.append(feature_frequency)

feature_frequencies_df = (

pd.DataFrame(

molecule_feature_frequencies,

index=[f"Mol{i}" for i, _ in enumerate(molecules, 1)],

)

.fillna(0)

.astype(int)

)

feature_frequencies_df.transpose()

	Mol1	Mol2	Mol3	Mol4
Donor	4	2	2	2
Acceptor	5	6	7	7
Aromatic	4	3	3	3
Hydrophobe	2	1	1	1
LumpedHydrophobe	1	1	1	0
PosIonizable	0	1	1	1

代码

文本

Furtheron, we concentrate in this talktorial only on the following feature types: hydrogen bond acceptors (acceptors), hydrogen bond donors (donors), and hydrophobic contacts (hydrophobics).

We retrieve the feature RDKit objects per feature type and per molecule.

代码

文本

[21]

acceptors = []

donors = []

hydrophobics = []

for molecule in molecules:

acceptors.append(feature_factory.GetFeaturesForMol(molecule, includeOnly="Acceptor"))

donors.append(feature_factory.GetFeaturesForMol(molecule, includeOnly="Donor"))

hydrophobics.append(feature_factory.GetFeaturesForMol(molecule, includeOnly="Hydrophobe"))

features = {

"donors": donors,

"acceptors": acceptors,

"hydrophobics": hydrophobics,

}

代码

文本

Show the pharmacophore features of all ligands

Pharmacophore feature types usually are displayed in defined colors, e.g. usually hydrogen bond donors, hydrogen bond acceptors, and hydrophobic contacts are colored green, red, and yellow, respectively.

代码

文本

[22]

feature_colors = {

"donors": (0, 0.9, 0), # Green

"acceptors": (0.9, 0, 0), # Red

"hydrophobics": (1, 0.9, 0), # Yellow

}

代码

文本

[23]

def visualize_features(

molecules,

features,

feature_type="features",

color="yellow",

sphere_radius=0.5,

"""Generate a view of the molecules highlighting the specified feature type.

Parameters

-----------

molecules: list of rdkit.Chem.rdchem.Mol

molecules to be visualized

features: list of tuples of rdkit.Chem.rdMolChemicalFeatures.MolChemicalFeature

extracted features from molecules chosen to be highlighted

feature_type: string, optional

name of the feature to be highlighted

color: string, optional

color used to display the highlighted features

sphere_radius: float, optional

display size of the highlighted features

Returns

----------

nglview.widget.NGLWidget

"""

print(f"Number of {feature_type} in all ligands: {sum([len(i) for i in features])}")

view = show_ligands(molecules)

for i, feature_set in enumerate(features, 1):

for feature in feature_set:

loc = list(feature.GetPos())

label = f"{feature_type}_{i}"

view.shape.add_sphere(loc, color, sphere_radius, label)

return view

代码

文本

We use this function to visualize the features for the feature types under consideration.

代码

文本

Hydrogen bond donors

代码

文本

[24]

feature_type = "donors"

view = visualize_features(

molecules,

features[feature_type],

feature_type,

feature_colors[feature_type],

)

view

Number of donors in all ligands: 10

代码

文本

[25]

view.render_image(trim=True, factor=2, transparent=True);

代码

文本

[26]

view._display_image()

代码

文本

Hydrogen bond acceptors

代码

文本

[27]

feature_type = "acceptors"

view = visualize_features(

molecules,

features[feature_type],

feature_type,

feature_colors[feature_type],

)

view

Number of acceptors in all ligands: 25

代码

文本

[28]

view.render_image(trim=True, factor=2, transparent=True);

代码

文本

[29]

view._display_image()

代码

文本

Hydrophobic contacts

代码

文本

[30]

feature_type = "hydrophobics"

view = visualize_features(

molecules,

features[feature_type],

feature_type,

feature_colors[feature_type],

)

view

Number of hydrophobics in all ligands: 5

代码

文本

[31]

view.render_image(trim=True, factor=2, transparent=True);

代码

文本

[32]

view._display_image()

代码

文本

Collect coordinates of features per feature type

We want to cluster features per feature type. To do this we collect all coordinates of features and store them per feature type.

代码

文本

[33]

features_coord = {

"donors": [list(item.GetPos()) for sublist in features["donors"] for item in sublist],

"acceptors": [list(item.GetPos()) for sublist in features["acceptors"] for item in sublist],

"hydrophobics": [

list(item.GetPos()) for sublist in features["hydrophobics"] for item in sublist

}

代码

文本

Now, we have the positions of e.g. all acceptor features:

代码

文本

[34]

features_coord["acceptors"]

[[-13.49, 14.732, -27.925],
 [-11.99, 13.164, -28.961],
 [-13.173, 17.112, -28.025],
 [-16.131, 17.359, -22.003],
 [-16.873, 11.631, -25.856],
 [-13.808, 14.802, -27.137],
 [-12.146, 16.255, -28.067],
 [-11.385, 12.957, -29.278],
 [-16.736, 10.854, -25.879],
 [-15.435, 19.394, -27.21],
 [-16.039, 17.251, -22.54],
 [-13.863, 14.66, -27.134],
 [-12.175, 16.118, -27.994],
 [-11.411, 12.856, -29.266],
 [-17.026, 10.809, -26.396],
 [-15.601, 9.901, -28.086],
 [-15.445, 19.229, -27.215],
 [-16.044, 17.165, -22.506],
 [-13.831, 14.581, -27.162],
 [-12.161, 16.037, -28.079],
 [-11.381, 12.733, -29.268],
 [-17.02, 10.739, -26.48],
 [-15.542, 9.814, -28.111],
 [-15.515, 19.1, -27.334],
 [-16.09, 17.203, -22.553]]

代码

文本

Generate ensemble pharmacophores

In order to generate ensemble pharmacophores, we use k-means clustering to cluster features per feature type.

代码

文本

Set static parameters for k-means clustering

kq: With this paramter, we determine the number of clusters k per feature type depending on the number of feature points, i.e. per feature type
k = number_of_features / kq
kq should be selected so that k (feature clusters) is for all clusters at least 1 and not larger than 4-5 clusters

We set our kq used to determine k in k-means as follows:

代码

文本

[35]

kq = 7

代码

文本

Set static parameters for cluster selection

min_cluster_size: We only want to retain clusters that potentially contain features from most molecules in our ligand ensemble. Therefore, we set this variable to 75% of the number of molecules in our ligand ensemble.
top_cluster_number: With this parameter, we select only the largest clusters.

代码

文本

[36]

min_cluster_size = int(len(molecules) * 0.75)

top_cluster_number = 4

代码

文本

Define k-means clustering and cluster selection functions

We define a function that calculates the centers of clusters, which are derived from k-means clustering.

代码

文本

[37]

def clustering(feature_coord, kq):

"""

Compute the k-means clustering of input feature coordinates.

Parameters

-----------

feature_coord: list

feature coordinates extracted from the features of a set of molecules

kq: int

Quotient used when calculating number of features k

Returns

----------

sklearn.cluster._kmeans.KMeans

"""

# Define parameter k as feature number divided by "k quotient"

k = math.ceil(len(feature_coord) / kq)

# Tailor-made adaption of k for hydrophobics in for the example in this talktorial

k = 2 if k == 1 else k

print(f"Clustering: \nVariable k in k-means: {k} of {len(feature_coord)} points\n")

# Initialize k-means and compute clustering

k_means = cluster.KMeans(n_clusters=k, n_init=10)

k_means.fit(feature_coord)

return k_means

代码

文本

We define a function that sorts the clusters by size and outputs a list of indices of the largest clusters.

代码

文本

[38]

def get_clusters(k_means, min_cluster_size, top_cluster_number):

"""

Select clusters from a k-means clustering and returns the indices of the selected clusters.

The selection is based on cluster size. The largest clusters are chosen.

Parameters

-----------

k_means: sklearn.cluster._kmeans.KMeans

k-means clustering of feature coordinates

min_cluster_size: int

minimum size needed for a cluster to be viable for selection

top_cluster_number: int

number of clusters to be selected

Returns

----------

list of int

indices of selected clusters

"""

# Sort clusters by size and only show largest

feature_labels = k_means.labels_

feature_labels_count = Counter(feature_labels)

feature_labels_count = sorted(

feature_labels_count.items(), key=operator.itemgetter(1), reverse=True

)

print(f"Clusters sorted by size: \n{feature_labels_count}\n")

# Get number of the largest clusters, which are larger then the threshold (selected clusters)

cluster_indices_sel = []

for cluster_index, cluster_size in feature_labels_count:

if cluster_size >= min_cluster_size and top_cluster_number > 0:

cluster_indices_sel.append(cluster_index)

top_cluster_number -= 1

print(f"Cluster indices of selected clusters: \n{cluster_indices_sel}\n")

return cluster_indices_sel

代码

文本

Cluster features

For each feature type, we perform the k-means clustering with our defined clustering function.

代码

文本

[39]

k_means = {

"donors": clustering(features_coord["donors"], kq),

"acceptors": clustering(features_coord["acceptors"], kq),

"hydrophobics": clustering(features_coord["hydrophobics"], kq),

}

Clustering: 
Variable k in k-means: 2 of 10 points

Clustering: 
Variable k in k-means: 4 of 25 points

Clustering: 
Variable k in k-means: 2 of 5 points

代码

文本

Select relevant clusters

For each feature type, we select relevant clusters with our defined get_clusters function.

代码

文本

[40]

print("Hydrogen bond donors\n")

cluster_indices_sel_donors = get_clusters(k_means["donors"], min_cluster_size, top_cluster_number)

Hydrogen bond donors

Clusters sorted by size: 
[(1, 6), (0, 4)]

Cluster indices of selected clusters: 
[1, 0]

代码

文本

[41]

print("Hydrogen bond acceptors\n")

cluster_indices_sel_acceptors = get_clusters(

k_means["acceptors"], min_cluster_size, top_cluster_number

)

Hydrogen bond acceptors

Clusters sorted by size: 
[(0, 12), (2, 6), (1, 4), (3, 3)]

Cluster indices of selected clusters: 
[0, 2, 1, 3]

代码

文本

[42]

print("Hydrophobic contacts\n")

cluster_indices_sel_hydrophobic = get_clusters(

k_means["hydrophobics"], min_cluster_size, top_cluster_number

)

Hydrophobic contacts

Clusters sorted by size: 
[(0, 4), (1, 1)]

Cluster indices of selected clusters: 
[0]

代码

文本

[43]

cluster_indices_sel = {

"donors": cluster_indices_sel_donors,

"acceptors": cluster_indices_sel_acceptors,

"hydrophobics": cluster_indices_sel_hydrophobic,

}

代码

文本

Get selected cluster coordinates

代码

文本

[44]

def get_selected_cluster_center_coords(k_means, cluster_indices_sel):

"""Retrieve cluster center coordinates for selected clusters.

The selected clusters are defined by their indices.

Parameters

-----------

k_means: sklearn.cluster._kmeans.KMeans

k-means clustering of feature coordinates

cluster_indices_sel: list of int

indices of the chosen clusters

Returns

----------

two-dimensional list of floats

cluster center coordinates

"""

cluster_centers = k_means.cluster_centers_

# Cast to list and then to pandas Series (for element selection by indices)

cluster_centers = pd.Series(cluster_centers.tolist())

# Select cluster centers by indices of selected clusters

cluster_centers_sel = cluster_centers[cluster_indices_sel]

return list(cluster_centers_sel)

代码

文本

[45]

cluster_centers_sel = {

"donors": get_selected_cluster_center_coords(k_means["donors"], cluster_indices_sel["donors"]),

"acceptors": get_selected_cluster_center_coords(

k_means["acceptors"], cluster_indices_sel["acceptors"]

"hydrophobics": get_selected_cluster_center_coords(

k_means["hydrophobics"], cluster_indices_sel["hydrophobics"]

}

代码

文本

[46]

cluster_centers_sel["acceptors"]

[[-12.567833333333333, 14.66725, -28.191333333333333],
 [-16.466333333333335, 10.624666666666666, -26.801333333333332],
 [-16.076, 17.244500000000002, -22.4005],
 [-15.465, 19.241, -27.253]]

代码

文本

Show clusters

Per feature type, we visualize cluster centers alongside with all molecules and all feature points.

代码

文本

[47]

def visualize_clusters(

molecules,

features,

clusters,

feature_type="features",

color="yellow",

feature_radius=0.5,

cluster_radius=1,

"""Visualize the cluster centers.

Parameters

-----------

molecules: list of rdkit.Chem.rdchem.Mol

molecules to be visualized

features: list of tuples of rdkit.Chem.rdMolChemicalFeatures.MolChemicalFeature

extracted features from molecules chosen to be highlighted

clusters: Two-dimensional list of floats

cluster center coordinates

feature_type: string, optional

feature type being visualized

color: string, optional

color used for rendering the features and cluster centers

feature_radius: float, optional

rendering size of the features

cluster_radius: float, optional

rendering size of the cluster centers

Returns

----------

nglview.widget.NGLWidget

"""

view = visualize_features(

molecules,

features,

feature_type,

color=color,

sphere_radius=feature_radius,

)

for i, center in enumerate(clusters, 1):

view.shape.add_sphere(list(center), color, cluster_radius, f"cluster_{feature_type}_{i}")

return view

代码

文本

Hydrogen bond donors

代码

文本

[48]

feature_type = "donors"

view = visualize_clusters(

molecules,

features[feature_type],

cluster_centers_sel[feature_type],

feature_type,

feature_colors[feature_type],

)

view

Number of donors in all ligands: 10

代码

文本

[49]

view.render_image(trim=True, factor=2, transparent=True);

代码

文本

[50]

view._display_image()

代码

文本

Hydrogen bond acceptor

代码

文本

[51]

feature_type = "acceptors"

view = visualize_clusters(

molecules,

features[feature_type],

cluster_centers_sel[feature_type],

feature_type,

feature_colors[feature_type],

)

view

Number of acceptors in all ligands: 25

代码

文本

[52]

view.render_image(trim=True, factor=2, transparent=True);

代码

文本

[53]

view._display_image()

代码

文本

Hydrophobic contacts

代码

文本

[54]

feature_type = "hydrophobics"

view = visualize_clusters(

molecules,

features[feature_type],

cluster_centers_sel[feature_type],

feature_type,

feature_colors[feature_type],

)

view

Number of hydrophobics in all ligands: 5

代码

文本

[55]

view.render_image(trim=True, factor=2, transparent=True);

代码

文本

[56]

view._display_image()

代码

文本

Show ensemble pharmacophore

In this last step, we combine the clustered pharmacophoric features (i.e. hydrogen bond donors and acceptors as well as hydrophobic contacts), to one ensemble pharmacophore, representing the pharmacophoric properties of the four selected ligands.

代码

文本

[57]

view = show_ligands(molecules)

# Load clusters

for feature_type in cluster_indices_sel.keys():

centers = cluster_centers_sel[feature_type]

for i, loc in enumerate(centers):

sphere_radius = 1

feature_color = feature_colors[feature_type]

label = f"{feature_type}_c{i}"

view.shape.add_sphere(loc, feature_color, sphere_radius, label)

view

代码

文本

[58]

view.render_image(trim=True, factor=2, transparent=True);

代码

文本

[59]

view._display_image()

代码

文本

Discussion

In this talktorial, we used a set of pre-aligned ligands, which are known to bind EGFR, to generate an ensemble pharmacophore model. This model could now be used for virtual screening against a large library of small molecules, in order to find novel small molecules that show the observed steric and physico-chemical properties and might therefore also bind to the EGFR binding site.

Before screening, the pharmacophore models are usually further optimized, e.g. features might be omitted in order to reduce the number of features for screening based on biological knowledge (some interaction might be reported as important whereas others are not) or based on chemical expertise.

We do not cover the virtual screening in this talktorial, however refer to an excellent tutorial by Nik Stiefl, demonstrating pharmacophore modeling and virtual screening with RDKit (RDKit UGM 2016 on GitHub).

We used k-means clustering to cluster pharmacophore feature. This clustering approach has the disadvantage that the user needs to define the number of clusters beforehand, which is usually based on a visual inspection of the point distribution before clustering (or during cluster refinement) and is therefore hindering for an automated pharmacophore generation. Density-based clustering methods, also in combination with k-means clustering, can be a solution for this. For further reading we recommend the introduction to Density-Based Clustering at Domino Data Lab.

Reference

https://github.com/volkamerlab/teachopencadd

Reprint statement

Original title: Ligand-based pharmacophores

Authors:

Pratik Dhakal, CADD seminar, 2017, Charité/FU Berlin
Florian Gusewski, CADD seminar, 2018, Charité/FU Berlin
Jaime Rodríguez-Guerra, Volkamer lab, Charité
Dominique Sydow, Volkamer lab, Charité

代码

文本

化学信息学

TeachOpenCADD

化学信息学TeachOpenCADD

点个赞吧

本文被以下合集收录

化学

bohr7916f3

更新于 2024-08-22

1 篇0 人关注

AI4S

凝聚态平方

更新于 2024-03-18

32 篇0 人关注

009 基于配体的药效团模型

Aim of this talktorial

Learning goals

Contents in Theory

Contents in Practical

References

Theory

Pharmacophores

Structure- and ligand-based pharmacophore modeling

Virtual screening with pharmacophores

Clustering: k-means

Practical

Get pre-aligned ligands from previous talktorial

Visualize with nglview

Extract pharmacophore features

Show the pharmacophore features of all ligands

Hydrogen bond donors

Hydrogen bond acceptors

Hydrophobic contacts

Collect coordinates of features per feature type

Generate ensemble pharmacophores

Set static parameters for k-means clustering

Set static parameters for cluster selection

Define k-means clustering and cluster selection functions

Cluster features

Select relevant clusters

Get selected cluster coordinates

Show clusters

Hydrogen bond donors

Hydrogen bond acceptor

Hydrophobic contacts

Show ensemble pharmacophore

Discussion

Reference

Reprint statement

Visualize with `nglview`