Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
Uni-Mol Docking演示(以PoseBusters为例)
Uni-Mol
docking
Uni-Moldocking
zhougm@dp.tech
更新于 2024-07-23
推荐镜像 :unimol-docking:pytorch1.12.1-cuda11.6
推荐机型 :c3_m4_1 * NVIDIA T4
2
1
背景
关于Uni-Mol
关于PoseBusters
运行前准备:
环境
代码、数据和模型
运行
导入模块
数据预处理函数,用于生成 lmdb 文件
从蛋白质 pdb 文件和配体 sdf 文件生成模型输入的 lmdb 文件
使用公开的模型权重进行推理
根据预测的距离矩阵进行Docking,之后计算RMSD指标:
计算对称RMSD指标
预测结构可视化

©️ 版权所有 2023 @ 作者
作者: 周耕墨 📨
日期:2024-7-24
Licenses:这个Bohrium笔记本使用Uni-Mol模型参数,其输出内容遵循创作共用署名 4.0 国际 (CC BY 4.0) 许可协议。您可以在以下网址找到详细信息:http://creativecommons.org/licenses/by-nc-sa/4.0
快速开始:点击上方的 开始连接 按钮,选择unimol-docking:pytorch1.12.1-cuda11.6镜像和GPU机器开始使用,最便宜的就可以。

代码
文本

alt img_v3_025c_0dbe5a36-1e6b-41f7-bc4a-2d60bd54282g.png

代码
文本

背景

代码
文本

关于Uni-Mol

Uni-Mol是由深势科技在2022年5月发布的基于分子结构的通用三维分子表示学习框架。Uni-Mol包含两个预训练模型,它们都采用相同的SE(3) Transformer架构:一个是通过209M分子构象预训练的分子模型;另一个是通过300万候选蛋白质口袋数据预训练的口袋模型。

利用三维结构信息并结合有效的预训练方案使得Uni-Mol在14/15个分子属性预测任务中超越了之前最佳的方法。尤其是Uni-Mol在三维空间相关任务中表现出色,包括蛋白质-配体结合姿态预测、分子构象生成等。该论文已被顶级机器学习会议ICLR 2023接受。

代码
文本

关于PoseBusters

PoseBusters是一个Python包,它使用知名的化学信息学工具包RDKit执行一系列标准的质量检查。只有那些通过这些检查并预测出类似天然的结合模式的方法,才应被视为具有“最先进”的性能。

PoseBusters基准集是一组新的、精心挑选的、公开可用的来自PDB的晶体复合物。它是一组多样化的、最近的高质量蛋白质-配体复合物,包含类药分子。它只包含自2021年以来发布的复合物,因此不包含PDBbind通用集v2020中的任何复合物,后者被用来训练Uni-Mol。

代码
文本

运行前准备:

环境

  • 基础 Docker 镜像:
dptechnology/unicore:latest-pytorch1.12.1-cuda11.6-rdma
  • 其他依赖项:RDKit 和 BioPandas:
rdkit==2022.9.3
biopandas==0.4.1
  • 数据:PoseBustersAstex 的蛋白质 pdf 文件和配体 sdf 文件

代码、数据和模型

  • 代码链接:https://github.com/deepmodeling/Uni-Mol

  • Commit: b962451 (b962451a019e15363bd34b3af9d3a3cd02330947)

  • 项目路径:/workspace/Uni-Mol

  • 数据路径:/workspace/Uni-Mol/eval_sets

  • 模型路径:/workspace/Uni-Mol/ckp/binding_pose_220908.pt (可从github仓库下载)

代码
文本

运行

导入模块

代码
文本
[1]
import os
import pickle
import numpy as np
import pandas as pd
from rdkit import Chem, RDLogger
from rdkit.Chem import AllChem
from tqdm import tqdm
RDLogger.DisableLog('rdApp.*')
import warnings
warnings.filterwarnings(action='ignore')
from multiprocessing import Pool
import copy
import lmdb
from biopandas.pdb import PandasPdb
from sklearn.cluster import KMeans
from rdkit.Chem.rdMolAlign import AlignMolConformers
代码
文本

数据预处理函数,用于生成 lmdb 文件

配体准备

从 sdf 文件中获取分子,并使用 RDKit 为其生成 100 个构象。然后,通过 k-means 将这些构象聚类为 10 个,并将它们作为模型的初始输入。

蛋白质准备

蛋白质口袋残基是指与任何配体晶体结构的重原子相距 6 Å 以内的残基。然后,从这些残基中提取原子,并过滤掉金属和稀有元素原子,以获得模型输入的口袋原子。

代码
文本
[2]
# allowed atom types
main_atoms = ['N', 'CA', 'C', 'O', 'H']
allow_pocket_atoms = ['C', 'H', 'N', 'O', 'S']

def cal_configs(coords):
"""Calculate pocket configs"""

centerx,centery,centerz = list((np.max(coords,axis=0)+np.min(coords,axis=0))/2)
sizex,sizey,sizez = list(np.max(coords,axis=0)-np.mean(coords,axis=0))
config = {'cx':centerx,'cy':centery,'cz':centerz,
'sx':sizex,'sy':sizey,'sz':sizez}
return config,centerx,centery,centerz,sizex,sizey,sizez


def filter_pocketatoms(atom):
if atom[:2] in ['Cd','Cs', 'Cn', 'Ce', 'Cm', 'Cf', 'Cl', 'Ca', \
'Cr', 'Co', 'Cu', 'Nh', 'Nd', 'Np', 'No', 'Ne', 'Na',\
'Ni','Nb', 'Os', 'Og', 'Hf', 'Hg', 'Hs', 'Ho', 'He',\
'Sr', 'Sn', 'Sb', 'Sg', 'Sm', 'Si', 'Sc', 'Se']:
return None
if atom[0] >= '0' and atom[0] <= '9':
return filter_pocketatoms(atom[1:])
if atom[0] in ['Z','M','P','D','F','K','I','B']:
return None
if atom[0] in allow_pocket_atoms:
return atom
return atom


def single_conf_gen(tgt_mol, num_confs=1000, seed=42, removeHs=True):
mol = copy.deepcopy(tgt_mol)
mol = Chem.AddHs(mol)
allconformers = AllChem.EmbedMultipleConfs(mol, numConfs=num_confs, randomSeed=seed, clearConfs=True)
sz = len(allconformers)
for i in range(sz):
try:
AllChem.MMFFOptimizeMolecule(mol, confId=i)
except:
continue
if removeHs:
mol = Chem.RemoveHs(mol)
return mol


def clustering_coords(mol, M=1000, N=100, seed=42, removeHs=True, method='bonds'):
rdkit_coords_list = []
if method == 'rdkit_MMFF':
rdkit_mol = single_conf_gen(mol, num_confs=M, seed=seed, removeHs=removeHs)
else:
print('no conformer generation methods:{}'.format(method))
raise
noHsIds = [rdkit_mol.GetAtoms()[i].GetIdx() for i in range(len(rdkit_mol.GetAtoms())) if rdkit_mol.GetAtoms()[i].GetAtomicNum()!=1]
# exclude hydrogens for aligning
AlignMolConformers(rdkit_mol, atomIds=noHsIds)
sz = len(rdkit_mol.GetConformers())
for i in range(sz):
_coords = rdkit_mol.GetConformers()[i].GetPositions().astype(np.float32)
rdkit_coords_list.append(_coords)
# cluster confs, select the nearest conf to the center
# (num_confs, num_atoms, 3)
rdkit_coords = np.array(rdkit_coords_list)[:, noHsIds]
# (num_confa, num_atoms, 3) -> (num_confs, num_atoms*3)
rdkit_coords_flatten = rdkit_coords.reshape(sz, -1)
kmeans = KMeans(n_clusters=N, random_state=seed).fit(rdkit_coords_flatten)
# (num_clusters, num_atoms, 3)
center_coords = kmeans.cluster_centers_.reshape(N, -1, 3)
# (num_cluster, num_confs)
cdist = ((center_coords[:, None] - rdkit_coords[None, :])**2).sum(axis=(-1, -2))
# (num_confs,)
argmin = np.argmin(cdist, axis=-1)
coords_list = [rdkit_coords_list[i] for i in argmin]
return coords_list


def extract_pose_posebuster(content):

pdbid, ligid, protein_path, ligand_path, index = content

def read_pdb(path, pdbid):
#### protein preparation
pfile = os.path.join(path, pdbid+'.pdb')
pmol = PandasPdb().read_pdb(pfile)
return pmol

### totally posebuster data
def read_mol(path, pdbid, ligid):
lsdf = os.path.join(path, f'{pdbid}_{ligid}.sdf')
supp = Chem.SDMolSupplier(lsdf)
mols = [mol for mol in supp if mol]
if len(mols) == 0:
print(lsdf)
mol = mols[0]
return mol

# influence pocket size
dist_thres=6
if pdbid == 'index' or pdbid == 'readme':
return None

pmol = read_pdb(protein_path, pdbid)
pname = pdbid
mol = read_mol(ligand_path, pdbid, ligid)
mol = Chem.RemoveHs(mol)
lcoords = mol.GetConformer().GetPositions().astype(np.float32)
pdf = pmol.df['ATOM']
filter_std = []
for lcoord in lcoords:
pdf['dist'] = pmol.distance(xyz=list(lcoord), records=('ATOM'))
df = pdf[(pdf.dist <= dist_thres) & (pdf.element_symbol != 'H')][['chain_id', 'residue_number']]
filter_std += list(zip(df.chain_id.tolist(), df.residue_number.tolist()))

filter_std = set(filter_std)
patoms, pcoords, residues = [], np.empty((0,3)), []
for id,res in filter_std:
df = pdf[(pdf.chain_id == id) & (pdf.residue_number == res)]
patoms += df['atom_name'].tolist()
pcoords = np.concatenate((pcoords, df[['x_coord','y_coord','z_coord']].to_numpy()), axis=0)
residues += [str(id)+str(res)]*len(df)

if len(pcoords)==0:
print('empty pocket:', pdbid)
return None
config,centerx,centery,centerz,sizex,sizey,sizez = cal_configs(pcoords)

# filter unnormal atoms, include metal
atoms, index, residues_tmp = [], [], []
for i,a in enumerate(patoms):
output = filter_pocketatoms(a)
if output is not None:
index.append(True)
atoms.append(output)
residues_tmp.append(residues[i])
else:
index.append(False)
coordinates = pcoords[index].astype(np.float32)
residues = residues_tmp

assert len(atoms) == len(residues)
assert len(atoms) == coordinates.shape[0]

if len(atoms) != coordinates.shape[0]:
print(pname)
return None
patoms = atoms
pcoords = [coordinates]
side = [0 if a in main_atoms else 1 for a in patoms]

smiles = Chem.MolToSmiles(mol)
mol = AllChem.AddHs(mol, addCoords=True)
latoms = [atom.GetSymbol() for atom in mol.GetAtoms()]
holo_coordinates = [mol.GetConformer().GetPositions().astype(np.float32)]
holo_mol = mol
M, N = 100, 10
coordinate_list = clustering_coords(mol, M=M, N=N, seed=42, removeHs=False, method='rdkit_MMFF')
mol_list = [mol]*N
ligand = [latoms, coordinate_list, holo_coordinates, smiles, mol_list, holo_mol]

return pname, patoms, pcoords, side, residues, config, ligand


def parser(content):
pname, patoms, pcoords, side, residues, config, ligand = extract_pose_posebuster(content)
latoms, coordinate_list, holo_coordinates, smiles, mol_list, holo_mol = ligand
pickle.dumps({})
return pickle.dumps(
{
"atoms": latoms,
"coordinates": coordinate_list,
"mol_list": mol_list,
"pocket_atoms": patoms,
"pocket_coordinates": pcoords,
"side": side,
"residue": residues,
"config": config,
"holo_coordinates": holo_coordinates,
"holo_mol": holo_mol,
"holo_pocket_coordinates": pcoords,
"smi": smiles,
'pocket':pname,
'scaffold':pname,
},
protocol=-1,
)


def write_lmdb(protein_path, ligand_path, outpath, meta_info_file, lmdb_name, num_ligand=428, nthreads=8):
os.makedirs(outpath, exist_ok=True)
df = pd.read_csv(meta_info_file)
print(f'Example of meta_info content: \n{df.head(1)}')
pdb_ids = list(df['pdb_code'].values)[:num_ligand]
lig_ids = list(df['lig_code'].values)[:num_ligand]
print(f'pdb code: {pdb_ids} \nlig code: {lig_ids}')
content_list = list(zip(pdb_ids, lig_ids, [protein_path]*len(pdb_ids), [ligand_path]*len(pdb_ids), range(len(pdb_ids))))
outputfilename = os.path.join(outpath, lmdb_name +'.lmdb')
try:
os.remove(outputfilename)
except:
pass
env_new = lmdb.open(
outputfilename,
subdir=False,
readonly=False,
lock=False,
readahead=False,
meminit=False,
max_readers=1,
map_size=int(100e9),
)
txn_write = env_new.begin(write=True)
print("Start preprocessing data...")
print(f'Number of systems: {len(pdb_ids)}')
with Pool(nthreads) as pool:
i = 0
failed_num = 0
for inner_output in tqdm(pool.imap(parser, content_list)):
if inner_output is not None:
txn_write.put(f"{i}".encode("ascii"), inner_output)
i+=1
elif inner_output is None:
failed_num += 1
txn_write.commit()
env_new.close()
print(f'\nTotal num: {len(pdb_ids)}, Success: {i}, Failed: {failed_num}')
print("Done!")
代码
文本

从蛋白质 pdb 文件和配体 sdf 文件生成模型输入的 lmdb 文件

数据描述 eval_sets

  • PoseBusters数据(428条)和Astex数据(85条)分别存储在 eval_sets 下的 posebustersastex 文件夹中。同一目录下的 parse_protein.py 脚本用于处理下载的原始pdb和sdf文件。

  • posebusters 目录中,蛋白质和配体文件夹分别存储运行 parse_protein 脚本后处理过的 pdbsdf 文件。pdb文件的命名格式为 {pdb_code}.pdb,sdf文件的命名格式为 {pdb_code}_{lig_code}.sdfposebuster_set_meta.csv 文件包含了PoseBusters基准测试中每条数据的pdb code、ligand code和相应的下载网址。原始数据通过这些网址从PDB下载。

  • Astex的数据目录结构与PoseBusters类似。

这里为了方便演示,选择前两个复合物

代码
文本
[3]
### workspace
project_path='/workspace/Uni-Mol'

# num of threads during preprocessing, the same as the num of CPUs.
nthreads = 12

### for posebusters
protein_path = f'{project_path}/eval_sets/posebusters/proteins'
ligand_path = f'{project_path}/eval_sets/posebusters/ligands'
lmdb_path = f'{project_path}/posebuster'
meta_info_file = f'{project_path}/eval_sets/posebusters/posebuster_set_meta.csv'
lmdb_name = 'posebuster_428'
num_ligand = 2 # choose the first two complexes to save time

### for astex
# protein_path = f'{project_path}/eval_sets/astex/proteins'
# ligand_path = f'{project_path}/eval_sets/astex/ligands'
# lmdb_path = f'{project_path}/astex'
# meta_info_file = f'{project_path}/eval_sets/astex/astex_set_meta.csv'
# lmdb_name = 'astex_85'
# num_ligand = 85

### generate lmdb
write_lmdb(protein_path, ligand_path, lmdb_path, meta_info_file, lmdb_name, num_ligand=num_ligand, nthreads=nthreads)
Example of meta_info content: 
  pdb_code lig_code                                  prot_url  \
0     5S8I      2LY  https://files.rcsb.org/download/5S8I.pdb   

                                             lig_url  \
0  http://ligand-expo.rcsb.org/files/2/2LY/isdf/5...   

                                            ligs  
0  <rdkit.Chem.rdchem.Mol object at 0x172fc16c0>  
pdb code: ['5S8I', '5SAK'] 
lig code: ['2LY', 'ZRY']
Start preprocessing data...
Number of systems: 2
2it [00:03,  1.58s/it]
Total num: 2, Success: 2, Failed: 0
Done!

代码
文本

使用公开的模型权重进行推理

该脚本与Uni-Mol Readme中的脚本相同

用于蛋白质-配体结合姿态预测的模型权重也可以从Uni-Mol仓库获得

代码
文本
[4]
data_path=lmdb_path
results_path=f'{project_path}/infer_pose' # replace to your results path
weight_path=f'{project_path}/ckp/binding_pose_220908.pt'
batch_size=8
dist_threshold=8.0
recycling=3
valid_subset=lmdb_name
mol_dict_name='dict_mol.txt'
pocket_dict_name='dict_pkt.txt'

!cp $project_path/example_data/molecule/dict.txt $data_path/$mol_dict_name
!cp $project_path/example_data/pocket/dict_coarse.txt $data_path/$pocket_dict_name
!python $project_path/unimol/infer.py --user-dir $project_path/unimol $data_path --valid-subset $valid_subset \
--results-path $results_path \
--num-workers 8 --ddp-backend=c10d --batch-size $batch_size \
--task docking_pose --loss docking_pose --arch docking_pose \
--path $weight_path \
--fp16 --fp16-init-scale 4 --fp16-scale-window 256 \
--dist-threshold $dist_threshold --recycling $recycling \
--log-interval 50 --log-format simple
2024-07-23 18:59:12 | INFO | unimol.inference | loading model(s) from /workspace/Uni-Mol/ckp/binding_pose_220908.pt
2024-07-23 18:59:12 | INFO | unimol.tasks.docking_pose | ligand dictionary: 30 types
2024-07-23 18:59:12 | INFO | unimol.tasks.docking_pose | pocket dictionary: 9 types
2024-07-23 18:59:16 | INFO | unimol.inference | Namespace(activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9, 0.999)', adam_eps=1e-08, all_gather_list_size=16384, allreduce_fp32_grad=False, arch='docking_pose', attention_dropout=0.1, batch_size=8, batch_size_valid=8, bf16=False, bf16_sr=False, broadcast_buffers=False, bucket_cap_mb=25, conf_size=10, cpu=False, curriculum=0, data='/workspace/Uni-Mol/posebuster', data_buffer_size=10, ddp_backend='c10d', delta_pair_repr_norm_loss=-1.0, device_id=0, disable_validation=False, dist_threshold=8.0, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.1, ema_decay=-1.0, emb_dropout=0.1, empty_cache_freq=0, encoder_attention_heads=64, encoder_embed_dim=512, encoder_ffn_embed_dim=2048, encoder_layers=15, fast_stat_sync=False, find_unused_parameters=False, finetune_mol_model=None, finetune_pocket_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=256, log_format='simple', log_interval=50, loss='docking_pose', lr_scheduler='fixed', lr_shrink=0.1, masked_coord_loss=-1.0, masked_dist_loss=-1.0, masked_token_loss=-1.0, max_pocket_atoms=256, max_seq_len=512, max_valid_steps=None, min_loss_scale=0.0001, model_overrides='{}', mol=Namespace(activation_dropout=0.0, activation_fn='gelu', attention_dropout=0.1, delta_pair_repr_norm_loss=-1.0, dropout=0.1, emb_dropout=0.1, encoder_attention_heads=64, encoder_embed_dim=512, encoder_ffn_embed_dim=2048, encoder_layers=15, masked_coord_loss=-1.0, masked_dist_loss=-1.0, masked_token_loss=-1.0, max_seq_len=512, pooler_activation_fn='tanh', pooler_dropout=0.0, post_ln=False, x_norm_loss=-1.0), no_progress_bar=False, no_seed_provided=False, nprocs_per_node=1, num_workers=8, optimizer='adam', path='/workspace/Uni-Mol/ckp/binding_pose_220908.pt', pocket=Namespace(activation_dropout=0.0, activation_fn='gelu', attention_dropout=0.1, delta_pair_repr_norm_loss=-1.0, dropout=0.1, emb_dropout=0.1, encoder_attention_heads=64, encoder_embed_dim=512, encoder_ffn_embed_dim=2048, encoder_layers=15, masked_coord_loss=-1.0, masked_dist_loss=-1.0, masked_token_loss=-1.0, max_seq_len=512, pooler_activation_fn='tanh', pooler_dropout=0.0, post_ln=False, x_norm_loss=-1.0), pooler_activation_fn='tanh', pooler_dropout=0.0, post_ln=False, profile=False, quiet=False, recycling=3, required_batch_size_multiple=1, results_path='/workspace/Uni-Mol/infer_pose', seed=1, skip_invalid_size_inputs_valid_test=False, suppress_crashes=False, task='docking_pose', tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', user_dir='/workspace/Uni-Mol/unimol', valid_subset='posebuster_428', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, validate_with_ema=False, wandb_name='', wandb_project='', warmup_updates=0, weight_decay=0.0, x_norm_loss=-1.0)
2024-07-23 18:59:16 | INFO | unicore.tasks.unicore_task | get EpochBatchIterator for epoch 1
2024-07-23 18:59:20 | INFO | unimol.inference | Done inference! 
代码
文本

根据预测的距离矩阵进行Docking,之后计算RMSD指标:

脚本与Uni-Mol Readme中的脚本相同

这里计算的RMSD是纯RMSD,不考虑对称性。

代码
文本
[5]
nthreads=nthreads
predict_file=f"{results_path}/ckp_{lmdb_name}.out.pkl" # Your inference file dir
reference_file=f"{lmdb_path}/{lmdb_name}.lmdb" # Your reference file dir
output_path=f"{project_path}/{lmdb_name}_predict_sdf" # Docking results path

%cd $project_path
!python $project_path/unimol/utils/docking.py --nthreads $nthreads --predict-file $predict_file --reference-file $reference_file --output-path $output_path
/workspace/Uni-Mol
100%|█████████████████████████████████████████████| 2/2 [00:00<00:00, 12.41it/s]
  0%|                                                     | 0/2 [00:00<?, ?it/s]5SAK-N=C1N/C(=N\Nc2ccccc2)c2ccccc21-RMSD:0.6344-0.0361-0.2067
5S8I-CNC(=O)c1scc2c1OCCO2-RMSD:1.0975-0.0144-0.2149
100%|█████████████████████████████████████████████| 2/2 [00:23<00:00, 11.97s/it]
RMSD < 1.0 :  0.5
RMSD < 1.5 :  1.0
RMSD < 2.0 :  1.0
RMSD < 3.0 :  1.0
RMSD < 5.0 :  1.0
avg RMSD :  0.8659875802603305
代码
文本

计算对称RMSD指标

代码
文本
[6]
from rdkit.Chem.rdMolAlign import CalcRMS

def get_mol(sdf_path):
supp = Chem.SDMolSupplier(sdf_path)
mols = [mol for mol in supp if mol]
if len(mols) == 0:
print(lsdf)
mol = mols[0]
return mol

def get_sym_rmsd(predicted_sdf_path, reference_sdf_path, meta_info_file):
df = pd.read_csv(meta_info_file)
pdb_ids = list(df['pdb_code'].values)[:2]
lig_ids = list(df['lig_code'].values)[:2]
print(f'calc rmsd for: \npdb code: {pdb_ids} \nlig code: {lig_ids}')
sym_rmsd_results = []
for pdbid, ligid in zip(pdb_ids, lig_ids):
ref_sdf = os.path.join(reference_sdf_path, f'{pdbid}_{ligid}.sdf')
prb_sdf = os.path.join(predicted_sdf_path, f'{pdbid}.ligand.sdf')
ref_mol = get_mol(ref_sdf)
prb_mol = get_mol(prb_sdf)
sym_rmsd = CalcRMS(
Chem.RemoveHs(prb_mol),
Chem.RemoveHs(ref_mol)
)
sym_rmsd_results.append(sym_rmsd)
sym_rmsd_results = np.array(sym_rmsd_results)
return sym_rmsd_results

def print_results(rmsd_results):
print('*'*50)
print(f'results length: {len(rmsd_results)}')
print('RMSD < 1.0 : ', np.mean(rmsd_results<1.0))
print('RMSD < 1.5 : ', np.mean(rmsd_results<1.5))
print('RMSD < 2.0 : ', np.mean(rmsd_results<2.0))
print('RMSD < 3.0 : ', np.mean(rmsd_results<3.0))
print('RMSD < 5.0 : ', np.mean(rmsd_results<5.0))
print('avg RMSD : ', np.mean(rmsd_results))
代码
文本
[7]
predicted_sdf_path = f'{output_path}/cache'
reference_sdf_path = ligand_path

### cal sym rmsd metrics
rmsd_results = get_sym_rmsd(predicted_sdf_path, reference_sdf_path, meta_info_file)
print_results(rmsd_results)
calc rmsd for: 
pdb code: ['5S8I', '5SAK'] 
lig code: ['2LY', 'ZRY']
**************************************************
results length: 2
RMSD < 1.0 :  0.5
RMSD < 1.5 :  1.0
RMSD < 2.0 :  1.0
RMSD < 3.0 :  1.0
RMSD < 5.0 :  1.0
avg RMSD :  0.8659898326762625
代码
文本

预测结构可视化

代码
文本
[8]
!pip install py3Dmol
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: py3Dmol in /opt/conda/lib/python3.8/site-packages (2.2.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
代码
文本
双击即可修改
代码
文本
[9]
import py3Dmol
pdb_id = '5SAK'
lig_id = 'ZRY'
pdb_path = os.path.join(protein_path, f'{pdb_id}.pdb')
ligand_path = os.path.join(predicted_sdf_path, f'{pdb_id}.ligand.sdf')
gt_ligand_path = os.path.join(reference_sdf_path, f'{pdb_id}_{lig_id}.sdf')

view = py3Dmol.view()
view.removeAllModels()

view.addModel(open(pdb_path,'r').read(),format='pdb')
view.setStyle({'cartoon': {'arrows':True, 'tubes':False, 'style':'oval', 'color':'white'}})
view.addSurface(py3Dmol.VDW,{'opacity':0.5,'color':'white'})

view.addModel(open(ligand_path,'r').read(),format='sdf')
ref_m = view.getModel()
ref_m.setStyle({},{'stick':{'colorscheme':'greenCarbon','radius':0.2}})

view.zoomTo(viewer=(100,0))
view.show()

view.removeAllModels()


view.addModel(open(ligand_path,'r').read(),format='sdf')
ref_m = view.getModel()
ref_m.setStyle({},{'stick':{'colorscheme':'greenCarbon','radius':0.2}})


view.addModel(open(gt_ligand_path,'r').read(),format='sdf')
ref_m = view.getModel()
ref_m.setStyle({},{'stick':{'colorscheme':'redCarbon','radius':0.2}})


view.zoomTo(viewer=(100,0))
view.show()
代码
文本

图中

绿色分子是unimol预测的结构

红色分子为晶体结构

代码
文本
Uni-Mol
docking
Uni-Moldocking
点个赞吧
推荐阅读
公开
深度学习模型在分子构象生成上表现更好吗?——利用RDKit和聚类算法获取高质量构象
Uni-Mol中文RDKit
Uni-Mol中文RDKit
zhougm@dp.tech
发布于 2023-11-12
1 赞
公开
手搓RHF其实不难只是慢(1):Gaussian型轨道和计算重叠积分
量子化学
量子化学
hanyanbo
发布于 2023-09-25
3 赞3 转存文件