Uni-Mol Docking Demonstration (Using PoseBusters as an Example)

空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

Uni-Mol Docking Demonstration (Using PoseBusters as an Example)

Uni-Mol

docking

Uni-Moldocking

Yani Guan

更新于 2024-10-24

推荐镜像 :unimol-docking:pytorch1.12.1-cuda11.6

推荐机型 :c3_m4_1 * NVIDIA T4

Background

About Uni-Mol

About PoseBusters

Preparation Before Running:：

Environment

Code, Data, and Model

Running

Import Modules

Data Preprocessing Function for Generating lmdb Files

Generating lmdb Files for Model Input from Protein pdb Files and Ligand sdf Files

Inference Using Public Model Weights

Perform Docking Based on the Predicted Distance Matrix, Then Calculate the RMSD Metric:

Calculate Symmetric RMSD Metric

Prediction Structure Visualization

©️ All rights reserved 2023 @ Author
Author: Gengmo Zhou 📨
Date: 2024-7-24
Licenses: This Bohrium notebook uses Uni-Mol model parameters, and its output content follows the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You can find detailed information at: http://creativecommons.org/licenses/by-nc-sa/4.0
Quick Start: Click the Start Connection button above, select the unimol-docking:pytorch1.12.1-cuda11.6 image and GPU machine to start using, the cheapest one will do.

代码

文本

alt img_v3_025c_0dbe5a36-1e6b-41f7-bc4a-2d60bd54282g.png

代码

文本

Background

代码

文本

About Uni-Mol

code：https://github.com/deepmodeling/Uni-Mol
paper：https://openreview.net/forum?id=6K2RM6wVqKu

Uni-Mol is a universal 3D molecular representation learning framework based on molecular structures, released by DeepModeling in May 2022. Uni-Mol includes two pre-trained models, both adopting the same SE(3) Transformer architecture: one is a molecular model pre-trained with 209M molecular conformations; the other is a pocket model pre-trained with 3 million candidate protein pocket data.

Utilizing 3D structural information combined with an effective pre-training scheme enables Uni-Mol to surpass previous best methods in 14 out of 15 molecular property prediction tasks. Notably, Uni-Mol excels in 3D space-related tasks, including protein-ligand binding pose prediction, molecular conformation generation, etc. The paper has been accepted by the top machine learning conference ICLR 2023.

代码

文本

About PoseBusters

paper：https://arxiv.org/pdf/2308.05777.pdf

PoseBusters is a Python package that performs a series of standard quality checks using the well-known cheminformatics toolkit RDKit. Only those methods that pass these checks and predict binding modes similar to natural ones should be considered to have "state-of-the-art" performance.

The PoseBusters benchmark set is a new, carefully curated, publicly available set of crystal complexes from the PDB. It is a diverse, recent collection of high-quality protein-ligand complexes containing drug-like molecules. It only includes complexes released since 2021, thus excluding any complexes from the PDBbind general set v2020, which was used to train Uni-Mol.

代码

文本

Preparation Before Running:：

Environment

Base Docker image：

dptechnology/unicore:latest-pytorch1.12.1-cuda11.6-rdma

Other dependencies: RDKit and BioPandas：

rdkit==2022.9.3
biopandas==0.4.1

Data: Protein PDF files and ligand SDF files from PoseBusters and Astex

Code, Data, and Model

Code link: https://github.com/deepmodeling/Uni-Mol
Commit: b962451 (b962451a019e15363bd34b3af9d3a3cd02330947)
Project path: /workspace/Uni-Mol
Data path: /workspace/Uni-Mol/eval_sets
Model path: /workspace/Uni-Mol/ckp/binding_pose_220908.pt (can be downloaded from the GitHub repository)

代码

文本

Running

Import Modules

代码

文本

[1]

import os

import pickle

import numpy as np

import pandas as pd

from rdkit import Chem, RDLogger

from rdkit.Chem import AllChem

from tqdm import tqdm

RDLogger.DisableLog('rdApp.*')

import warnings

warnings.filterwarnings(action='ignore')

from multiprocessing import Pool

import copy

import lmdb

from biopandas.pdb import PandasPdb

from sklearn.cluster import KMeans

from rdkit.Chem.rdMolAlign import AlignMolConformers

代码

文本

Data Preprocessing Function for Generating `lmdb` Files

Ligand Preparation

Extract molecules from SDF files and use RDKit to generate 100 conformations for each. Then, cluster these conformations into 10 groups using k-means, and use them as the initial input for the model.

Protein Preparation

Protein pocket residues are defined as residues within 6 Å of any ligand crystal structure's heavy atoms. Then, extract atoms from these residues and filter out metal and rare element atoms to obtain the pocket atoms for model input.

代码

文本

[2]

999

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

›

⌄

# allowed atom types

main_atoms = ['N', 'CA', 'C', 'O', 'H']

allow_pocket_atoms = ['C', 'H', 'N', 'O', 'S']

def cal_configs(coords):

"""Calculate pocket configs"""

centerx,centery,centerz = list((np.max(coords,axis=0)+np.min(coords,axis=0))/2)

sizex,sizey,sizez = list(np.max(coords,axis=0)-np.mean(coords,axis=0))

config = {'cx':centerx,'cy':centery,'cz':centerz,

'sx':sizex,'sy':sizey,'sz':sizez}

return config,centerx,centery,centerz,sizex,sizey,sizez

def filter_pocketatoms(atom):

if atom[:2] in ['Cd','Cs', 'Cn', 'Ce', 'Cm', 'Cf', 'Cl', 'Ca', \

'Cr', 'Co', 'Cu', 'Nh', 'Nd', 'Np', 'No', 'Ne', 'Na',\

'Ni','Nb', 'Os', 'Og', 'Hf', 'Hg', 'Hs', 'Ho', 'He',\

'Sr', 'Sn', 'Sb', 'Sg', 'Sm', 'Si', 'Sc', 'Se']:

return None

if atom[0] >= '0' and atom[0] <= '9':

return filter_pocketatoms(atom[1:])

if atom[0] in ['Z','M','P','D','F','K','I','B']:

return None

if atom[0] in allow_pocket_atoms:

return atom

def single_conf_gen(tgt_mol, num_confs=1000, seed=42, removeHs=True):

mol = copy.deepcopy(tgt_mol)

mol = Chem.AddHs(mol)

allconformers = AllChem.EmbedMultipleConfs(mol, numConfs=num_confs, randomSeed=seed, clearConfs=True)

sz = len(allconformers)

for i in range(sz):

try:

AllChem.MMFFOptimizeMolecule(mol, confId=i)

except:

continue

if removeHs:

mol = Chem.RemoveHs(mol)

return mol

def clustering_coords(mol, M=1000, N=100, seed=42, removeHs=True, method='bonds'):

rdkit_coords_list = []

if method == 'rdkit_MMFF':

rdkit_mol = single_conf_gen(mol, num_confs=M, seed=seed, removeHs=removeHs)

else:

print('no conformer generation methods:{}'.format(method))

raise

noHsIds = [rdkit_mol.GetAtoms()[i].GetIdx() for i in range(len(rdkit_mol.GetAtoms())) if rdkit_mol.GetAtoms()[i].GetAtomicNum()!=1]

# exclude hydrogens for aligning

AlignMolConformers(rdkit_mol, atomIds=noHsIds)

sz = len(rdkit_mol.GetConformers())

for i in range(sz):

_coords = rdkit_mol.GetConformers()[i].GetPositions().astype(np.float32)

rdkit_coords_list.append(_coords)

# cluster confs, select the nearest conf to the center

# (num_confs, num_atoms, 3)

rdkit_coords = np.array(rdkit_coords_list)[:, noHsIds]

# (num_confa, num_atoms, 3) -> (num_confs, num_atoms*3)

rdkit_coords_flatten = rdkit_coords.reshape(sz, -1)

kmeans = KMeans(n_clusters=N, random_state=seed).fit(rdkit_coords_flatten)

# (num_clusters, num_atoms, 3)

center_coords = kmeans.cluster_centers_.reshape(N, -1, 3)

# (num_cluster, num_confs)

cdist = ((center_coords[:, None] - rdkit_coords[None, :])**2).sum(axis=(-1, -2))

# (num_confs,)

argmin = np.argmin(cdist, axis=-1)

coords_list = [rdkit_coords_list[i] for i in argmin]

return coords_list

def extract_pose_posebuster(content):

pdbid, ligid, protein_path, ligand_path, index = content

def read_pdb(path, pdbid):

#### protein preparation

pfile = os.path.join(path, pdbid+'.pdb')

pmol = PandasPdb().read_pdb(pfile)

return pmol

### totally posebuster data

def read_mol(path, pdbid, ligid):

lsdf = os.path.join(path, f'{pdbid}_{ligid}.sdf')

supp = Chem.SDMolSupplier(lsdf)

mols = [mol for mol in supp if mol]

if len(mols) == 0:

print(lsdf)

mol = mols[0]

return mol

# influence pocket size

dist_thres=6

if pdbid == 'index' or pdbid == 'readme':

return None

pmol = read_pdb(protein_path, pdbid)

pname = pdbid

mol = read_mol(ligand_path, pdbid, ligid)

mol = Chem.RemoveHs(mol)

lcoords = mol.GetConformer().GetPositions().astype(np.float32)

pdf = pmol.df['ATOM']

filter_std = []

for lcoord in lcoords:

pdf['dist'] = pmol.distance(xyz=list(lcoord), records=('ATOM'))

df = pdf[(pdf.dist <= dist_thres) & (pdf.element_symbol != 'H')][['chain_id', 'residue_number']]

filter_std += list(zip(df.chain_id.tolist(), df.residue_number.tolist()))

filter_std = set(filter_std)

patoms, pcoords, residues = [], np.empty((0,3)), []

for id,res in filter_std:

df = pdf[(pdf.chain_id == id) & (pdf.residue_number == res)]

patoms += df['atom_name'].tolist()

pcoords = np.concatenate((pcoords, df[['x_coord','y_coord','z_coord']].to_numpy()), axis=0)

residues += [str(id)+str(res)]*len(df)

if len(pcoords)==0:

print('empty pocket:', pdbid)

return None

config,centerx,centery,centerz,sizex,sizey,sizez = cal_configs(pcoords)

# filter unnormal atoms, include metal

atoms, index, residues_tmp = [], [], []

for i,a in enumerate(patoms):

output = filter_pocketatoms(a)

if output is not None:

index.append(True)

atoms.append(output)

residues_tmp.append(residues[i])

else:

index.append(False)

coordinates = pcoords[index].astype(np.float32)

residues = residues_tmp

assert len(atoms) == len(residues)

assert len(atoms) == coordinates.shape[0]

if len(atoms) != coordinates.shape[0]:

print(pname)

return None

patoms = atoms

pcoords = [coordinates]

side = [0 if a in main_atoms else 1 for a in patoms]

smiles = Chem.MolToSmiles(mol)

mol = AllChem.AddHs(mol, addCoords=True)

latoms = [atom.GetSymbol() for atom in mol.GetAtoms()]

holo_coordinates = [mol.GetConformer().GetPositions().astype(np.float32)]

holo_mol = mol

M, N = 100, 10

coordinate_list = clustering_coords(mol, M=M, N=N, seed=42, removeHs=False, method='rdkit_MMFF')

mol_list = [mol]*N

ligand = [latoms, coordinate_list, holo_coordinates, smiles, mol_list, holo_mol]

return pname, patoms, pcoords, side, residues, config, ligand

def parser(content):

pname, patoms, pcoords, side, residues, config, ligand = extract_pose_posebuster(content)

latoms, coordinate_list, holo_coordinates, smiles, mol_list, holo_mol = ligand

pickle.dumps({})

return pickle.dumps(

{

"atoms": latoms,

"coordinates": coordinate_list,

"mol_list": mol_list,

"pocket_atoms": patoms,

"pocket_coordinates": pcoords,

"side": side,

"residue": residues,

"config": config,

"holo_coordinates": holo_coordinates,

"holo_mol": holo_mol,

"holo_pocket_coordinates": pcoords,

"smi": smiles,

'pocket':pname,

'scaffold':pname,

protocol=-1,

)

def write_lmdb(protein_path, ligand_path, outpath, meta_info_file, lmdb_name, num_ligand=428, nthreads=8):

os.makedirs(outpath, exist_ok=True)

df = pd.read_csv(meta_info_file)

print(f'Example of meta_info content: \n{df.head(1)}')

pdb_ids = list(df['pdb_code'].values)[:num_ligand]

lig_ids = list(df['lig_code'].values)[:num_ligand]

print(f'pdb code: {pdb_ids} \nlig code: {lig_ids}')

content_list = list(zip(pdb_ids, lig_ids, [protein_path]*len(pdb_ids), [ligand_path]*len(pdb_ids), range(len(pdb_ids))))

outputfilename = os.path.join(outpath, lmdb_name +'.lmdb')

try:

os.remove(outputfilename)

except:

pass

env_new = lmdb.open(

outputfilename,

subdir=False,

readonly=False,

lock=False,

readahead=False,

meminit=False,

max_readers=1,

map_size=int(100e9),

)

txn_write = env_new.begin(write=True)

print("Start preprocessing data...")

print(f'Number of systems: {len(pdb_ids)}')

with Pool(nthreads) as pool:

i = 0

failed_num = 0

for inner_output in tqdm(pool.imap(parser, content_list)):

if inner_output is not None:

txn_write.put(f"{i}".encode("ascii"), inner_output)

i+=1

elif inner_output is None:

failed_num += 1

txn_write.commit()

env_new.close()

print(f'\nTotal num: {len(pdb_ids)}, Success: {i}, Failed: {failed_num}')

print("Done!")

代码

文本

Generating `lmdb` Files for Model Input from Protein `pdb` Files and Ligand `sdf` Files

Data Description eval_sets

PoseBusters data (428 entries) and Astex data (85 entries) are stored in the posebusters and astex folders under eval_sets, respectively. The parse_protein.py script in the same directory is used to process the downloaded raw pdb and sdf files.
In the posebusters directory, the protein and ligand folders store the processed pdb and sdf files after running the parse_protein script. The naming format for pdb files is {pdb_code}.pdb, and for sdf files, it is {pdb_code}_{lig_code}.sdf. The posebuster_set_meta.csv file contains the pdb code, ligand code, and corresponding download URLs for each entry in the PoseBusters benchmark. The raw data is downloaded from PDB via these URLs.
The data directory structure for Astex is similar to that of PoseBusters.

Here, for demonstration purposes, the first two complexes are selected.

代码

文本

[3]

### workspace

project_path='/workspace/Uni-Mol'

# num of threads during preprocessing, the same as the num of CPUs.

nthreads = 12

### for posebusters

protein_path = f'{project_path}/eval_sets/posebusters/proteins'

ligand_path = f'{project_path}/eval_sets/posebusters/ligands'

lmdb_path = f'{project_path}/posebuster'

meta_info_file = f'{project_path}/eval_sets/posebusters/posebuster_set_meta.csv'

lmdb_name = 'posebuster_428'

num_ligand = 2 # choose the first two complexes to save time

### for astex

# protein_path = f'{project_path}/eval_sets/astex/proteins'

# ligand_path = f'{project_path}/eval_sets/astex/ligands'

# lmdb_path = f'{project_path}/astex'

# meta_info_file = f'{project_path}/eval_sets/astex/astex_set_meta.csv'

# lmdb_name = 'astex_85'

# num_ligand = 85

### generate lmdb

write_lmdb(protein_path, ligand_path, lmdb_path, meta_info_file, lmdb_name, num_ligand=num_ligand, nthreads=nthreads)

Example of meta_info content: 
  pdb_code lig_code                                  prot_url  \
0     5S8I      2LY  https://files.rcsb.org/download/5S8I.pdb   

                                             lig_url  \
0  http://ligand-expo.rcsb.org/files/2/2LY/isdf/5...   

                                            ligs  
0  <rdkit.Chem.rdchem.Mol object at 0x172fc16c0>  
pdb code: ['5S8I', '5SAK'] 
lig code: ['2LY', 'ZRY']
Start preprocessing data...
Number of systems: 2
2it [00:03,  1.58s/it]
Total num: 2, Success: 2, Failed: 0
Done!

代码

文本

Inference Using Public Model Weights

This script is the same as the one in the Uni-Mol Readme.

The model weights for protein-ligand binding pose prediction can also be obtained from the Uni-Mol repository.

代码

文本

[4]

data_path=lmdb_path

results_path=f'{project_path}/infer_pose' # replace to your results path

weight_path=f'{project_path}/ckp/binding_pose_220908.pt'

batch_size=8

dist_threshold=8.0

recycling=3

valid_subset=lmdb_name

mol_dict_name='dict_mol.txt'

pocket_dict_name='dict_pkt.txt'

!cp $project_path/example_data/molecule/dict.txt $data_path/$mol_dict_name

!cp $project_path/example_data/pocket/dict_coarse.txt $data_path/$pocket_dict_name

!python $project_path/unimol/infer.py --user-dir $project_path/unimol $data_path --valid-subset $valid_subset \

--results-path $results_path \

--num-workers 8 --ddp-backend=c10d --batch-size $batch_size \

--task docking_pose --loss docking_pose --arch docking_pose \

--path $weight_path \

--fp16 --fp16-init-scale 4 --fp16-scale-window 256 \

--dist-threshold $dist_threshold --recycling $recycling \

--log-interval 50 --log-format simple

2024-07-23 18:59:12 | INFO | unimol.inference | loading model(s) from /workspace/Uni-Mol/ckp/binding_pose_220908.pt
2024-07-23 18:59:12 | INFO | unimol.tasks.docking_pose | ligand dictionary: 30 types
2024-07-23 18:59:12 | INFO | unimol.tasks.docking_pose | pocket dictionary: 9 types
2024-07-23 18:59:16 | INFO | unimol.inference | Namespace(activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9, 0.999)', adam_eps=1e-08, all_gather_list_size=16384, allreduce_fp32_grad=False, arch='docking_pose', attention_dropout=0.1, batch_size=8, batch_size_valid=8, bf16=False, bf16_sr=False, broadcast_buffers=False, bucket_cap_mb=25, conf_size=10, cpu=False, curriculum=0, data='/workspace/Uni-Mol/posebuster', data_buffer_size=10, ddp_backend='c10d', delta_pair_repr_norm_loss=-1.0, device_id=0, disable_validation=False, dist_threshold=8.0, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.1, ema_decay=-1.0, emb_dropout=0.1, empty_cache_freq=0, encoder_attention_heads=64, encoder_embed_dim=512, encoder_ffn_embed_dim=2048, encoder_layers=15, fast_stat_sync=False, find_unused_parameters=False, finetune_mol_model=None, finetune_pocket_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=256, log_format='simple', log_interval=50, loss='docking_pose', lr_scheduler='fixed', lr_shrink=0.1, masked_coord_loss=-1.0, masked_dist_loss=-1.0, masked_token_loss=-1.0, max_pocket_atoms=256, max_seq_len=512, max_valid_steps=None, min_loss_scale=0.0001, model_overrides='{}', mol=Namespace(activation_dropout=0.0, activation_fn='gelu', attention_dropout=0.1, delta_pair_repr_norm_loss=-1.0, dropout=0.1, emb_dropout=0.1, encoder_attention_heads=64, encoder_embed_dim=512, encoder_ffn_embed_dim=2048, encoder_layers=15, masked_coord_loss=-1.0, masked_dist_loss=-1.0, masked_token_loss=-1.0, max_seq_len=512, pooler_activation_fn='tanh', pooler_dropout=0.0, post_ln=False, x_norm_loss=-1.0), no_progress_bar=False, no_seed_provided=False, nprocs_per_node=1, num_workers=8, optimizer='adam', path='/workspace/Uni-Mol/ckp/binding_pose_220908.pt', pocket=Namespace(activation_dropout=0.0, activation_fn='gelu', attention_dropout=0.1, delta_pair_repr_norm_loss=-1.0, dropout=0.1, emb_dropout=0.1, encoder_attention_heads=64, encoder_embed_dim=512, encoder_ffn_embed_dim=2048, encoder_layers=15, masked_coord_loss=-1.0, masked_dist_loss=-1.0, masked_token_loss=-1.0, max_seq_len=512, pooler_activation_fn='tanh', pooler_dropout=0.0, post_ln=False, x_norm_loss=-1.0), pooler_activation_fn='tanh', pooler_dropout=0.0, post_ln=False, profile=False, quiet=False, recycling=3, required_batch_size_multiple=1, results_path='/workspace/Uni-Mol/infer_pose', seed=1, skip_invalid_size_inputs_valid_test=False, suppress_crashes=False, task='docking_pose', tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', user_dir='/workspace/Uni-Mol/unimol', valid_subset='posebuster_428', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, validate_with_ema=False, wandb_name='', wandb_project='', warmup_updates=0, weight_decay=0.0, x_norm_loss=-1.0)
2024-07-23 18:59:16 | INFO | unicore.tasks.unicore_task | get EpochBatchIterator for epoch 1
2024-07-23 18:59:20 | INFO | unimol.inference | Done inference!

代码

文本

Perform Docking Based on the Predicted Distance Matrix, Then Calculate the RMSD Metric:

The script is the same as the one in the Uni-Mol Readme.

The RMSD calculated here is the pure RMSD, without considering symmetry.

代码

文本

[5]

nthreads=nthreads

predict_file=f"{results_path}/ckp_{lmdb_name}.out.pkl" # Your inference file dir

reference_file=f"{lmdb_path}/{lmdb_name}.lmdb" # Your reference file dir

output_path=f"{project_path}/{lmdb_name}_predict_sdf" # Docking results path

%cd $project_path

!python $project_path/unimol/utils/docking.py --nthreads $nthreads --predict-file $predict_file --reference-file $reference_file --output-path $output_path

/workspace/Uni-Mol
100%|█████████████████████████████████████████████| 2/2 [00:00<00:00, 12.41it/s]
  0%|                                                     | 0/2 [00:00<?, ?it/s]5SAK-N=C1N/C(=N\Nc2ccccc2)c2ccccc21-RMSD:0.6344-0.0361-0.2067
5S8I-CNC(=O)c1scc2c1OCCO2-RMSD:1.0975-0.0144-0.2149
100%|█████████████████████████████████████████████| 2/2 [00:23<00:00, 11.97s/it]
RMSD < 1.0 :  0.5
RMSD < 1.5 :  1.0
RMSD < 2.0 :  1.0
RMSD < 3.0 :  1.0
RMSD < 5.0 :  1.0
avg RMSD :  0.8659875802603305

代码

文本

Calculate Symmetric RMSD Metric

代码

文本

[6]

from rdkit.Chem.rdMolAlign import CalcRMS

def get_mol(sdf_path):

supp = Chem.SDMolSupplier(sdf_path)

mols = [mol for mol in supp if mol]

if len(mols) == 0:

print(lsdf)

mol = mols[0]

return mol

def get_sym_rmsd(predicted_sdf_path, reference_sdf_path, meta_info_file):

df = pd.read_csv(meta_info_file)

pdb_ids = list(df['pdb_code'].values)[:2]

lig_ids = list(df['lig_code'].values)[:2]

print(f'calc rmsd for: \npdb code: {pdb_ids} \nlig code: {lig_ids}')

sym_rmsd_results = []

for pdbid, ligid in zip(pdb_ids, lig_ids):

ref_sdf = os.path.join(reference_sdf_path, f'{pdbid}_{ligid}.sdf')

prb_sdf = os.path.join(predicted_sdf_path, f'{pdbid}.ligand.sdf')

ref_mol = get_mol(ref_sdf)

prb_mol = get_mol(prb_sdf)

sym_rmsd = CalcRMS(

Chem.RemoveHs(prb_mol),

Chem.RemoveHs(ref_mol)

)

sym_rmsd_results.append(sym_rmsd)

sym_rmsd_results = np.array(sym_rmsd_results)

return sym_rmsd_results

def print_results(rmsd_results):

print('*'*50)

print(f'results length: {len(rmsd_results)}')

print('RMSD < 1.0 : ', np.mean(rmsd_results<1.0))

print('RMSD < 1.5 : ', np.mean(rmsd_results<1.5))

print('RMSD < 2.0 : ', np.mean(rmsd_results<2.0))

print('RMSD < 3.0 : ', np.mean(rmsd_results<3.0))

print('RMSD < 5.0 : ', np.mean(rmsd_results<5.0))

print('avg RMSD : ', np.mean(rmsd_results))

代码

文本

[7]

predicted_sdf_path = f'{output_path}/cache'

reference_sdf_path = ligand_path

### cal sym rmsd metrics

rmsd_results = get_sym_rmsd(predicted_sdf_path, reference_sdf_path, meta_info_file)

print_results(rmsd_results)

calc rmsd for: 
pdb code: ['5S8I', '5SAK'] 
lig code: ['2LY', 'ZRY']
**************************************************
results length: 2
RMSD < 1.0 :  0.5
RMSD < 1.5 :  1.0
RMSD < 2.0 :  1.0
RMSD < 3.0 :  1.0
RMSD < 5.0 :  1.0
avg RMSD :  0.8659898326762625

代码

文本

Prediction Structure Visualization

代码

文本

[8]

!pip install py3Dmol

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: py3Dmol in /opt/conda/lib/python3.8/site-packages (2.2.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

代码

文本

双击即可修改

代码

文本

[9]

import py3Dmol

pdb_id = '5SAK'

lig_id = 'ZRY'

pdb_path = os.path.join(protein_path, f'{pdb_id}.pdb')

ligand_path = os.path.join(predicted_sdf_path, f'{pdb_id}.ligand.sdf')

gt_ligand_path = os.path.join(reference_sdf_path, f'{pdb_id}_{lig_id}.sdf')

view = py3Dmol.view()

view.removeAllModels()

view.addModel(open(pdb_path,'r').read(),format='pdb')

view.setStyle({'cartoon': {'arrows':True, 'tubes':False, 'style':'oval', 'color':'white'}})

view.addSurface(py3Dmol.VDW,{'opacity':0.5,'color':'white'})

view.addModel(open(ligand_path,'r').read(),format='sdf')

ref_m = view.getModel()

ref_m.setStyle({},{'stick':{'colorscheme':'greenCarbon','radius':0.2}})

view.zoomTo(viewer=(100,0))

view.show()

view.removeAllModels()

view.addModel(open(ligand_path,'r').read(),format='sdf')

ref_m = view.getModel()

ref_m.setStyle({},{'stick':{'colorscheme':'greenCarbon','radius':0.2}})

view.addModel(open(gt_ligand_path,'r').read(),format='sdf')

ref_m = view.getModel()

ref_m.setStyle({},{'stick':{'colorscheme':'redCarbon','radius':0.2}})

view.zoomTo(viewer=(100,0))

view.show()

代码

文本

In the image

The green molecule is the structure predicted by unimol

The red molecule is the crystal structure

代码

文本

Uni-Mol

docking

Uni-Moldocking

点个赞吧