Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
Demo for Uni-Mol Docking on PoseBusters
Uni-Mol
docking
Uni-Moldocking
zhougm@dp.tech
发布于 2023-11-16
推荐镜像 :unimol-docking:pytorch1.12.1-cuda11.6
推荐机型 :c12_m46_1 * NVIDIA GPU B
赞 4
5
1
Uni-Mol achieved the best open source result on the PoseBusters docking benchmark!
Objective for this demo
Background
About Uni-Mol
About PoseBusters
Please note before running:
Environment
Code, data and model
Pipeline
Import modules
Data preprocessing func for generating lmdb files
Generate the model input lmdb file from protein pdb files and ligand sdf files
Infer with public ckp
Docking, then calculate RMSD metrics:
Calculate symmetric RMSD metrics

©️ Copyright 2023 @ Authors
Author: Gengmo Zhou 📨
Date:2023-11-16
Licenses:This Bohrium notebook uses the Uni-Mol model parameters, and its outputs are under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You can find details at: http://creativecommons.org/licenses/by-nc-sa/4.0
Quick Start:Click the Connect button above,select unimol-docking:pytorch1.12.1-cuda11.6 image and GPU machine to start。

代码
文本

Uni-Mol achieved the best open source result on the PoseBusters docking benchmark!

alt img_v3_025c_0dbe5a36-1e6b-41f7-bc4a-2d60bd54282g.png

代码
文本

Objective for this demo

This demo aims to help readers reproduce the results of Uni-Mol on the PoseBusters benchmark from scratch.

代码
文本

Background

代码
文本

About Uni-Mol

Uni-Mol is a universal 3D molecular representation learning framework based on molecular structures released by DP Technology in May 2022. Uni-Mol contains two pretrained models with the same SE(3) Transformer architecture: a molecular model pretrained by 209M molecular conformations; a pocket model pretrained by 3M candidate protein pocket data.

Besides, the utilization of 3D structures and effective pretraining schemes enable Uni-Mol outperforms SOTA in 14/15 molecular property prediction tasks. Moreover, Uni-Mol achieves superior performance in 3D spatial tasks, including protein-ligand binding pose prediction, molecular conformation generation, etc. The paper has been accepted by the top machine learning conference, ICLR 2023.

代码
文本

About PoseBusters

PoseBusters is a Python package that performs a series of standard quality checks using the well-established cheminformatics toolkit RDKit. Only methods that both pass these checks and predict native-like binding modes should be classed as having “state-of-the-art” performance.

The PoseBusters Benchmark set is a new set of carefully-selected publicly-available crystal complexes from the PDB. It is a diverse set of recent high-quality protein-ligand complexes which contain drug-like molecules. It only contains complexes released since 2021 and therefore does not contain any complexes present in the PDBbind General Set v2020 used to train Uni-Mol.

代码
文本

Please note before running:

Environment

  • Basic Docker image:
dptechnology/unicore:latest-pytorch1.12.1-cuda11.6-rdma
  • Other dependencies: RDKit and BioPandas:
rdkit==2022.9.3
biopandas==0.4.1
  • Data: protein pdf files and ligand sdf files of PoseBusters and Astex

Code, data and model

代码
文本

Pipeline

Import modules

代码
文本
[1]
import os
import pickle
import numpy as np
import pandas as pd
from rdkit import Chem, RDLogger
from rdkit.Chem import AllChem
from tqdm import tqdm
RDLogger.DisableLog('rdApp.*')
import warnings
warnings.filterwarnings(action='ignore')
from multiprocessing import Pool
import copy
import lmdb
from biopandas.pdb import PandasPdb
from sklearn.cluster import KMeans
from rdkit.Chem.rdMolAlign import AlignMolConformers
代码
文本

Data preprocessing func for generating lmdb files

Ligand preparation

The molecule is obtained from the sdf file and generate 100 conformations for it using RDKit. Then, these conformations are clustered into 10 by k-means and use them as the initial input for the model.

Protein preparation

The binding pockets residues are those within 6 Å of any crystal ligand heavy atom. Then ,the atoms are extracted from these residues and filter out metal and rare element atoms to obtain the pocket atoms for model input.

代码
文本
[2]
# allowed atom types
main_atoms = ['N', 'CA', 'C', 'O', 'H']
allow_pocket_atoms = ['C', 'H', 'N', 'O', 'S']

def cal_configs(coords):
"""Calculate pocket configs"""

centerx,centery,centerz = list((np.max(coords,axis=0)+np.min(coords,axis=0))/2)
sizex,sizey,sizez = list(np.max(coords,axis=0)-np.mean(coords,axis=0))
config = {'cx':centerx,'cy':centery,'cz':centerz,
'sx':sizex,'sy':sizey,'sz':sizez}
return config,centerx,centery,centerz,sizex,sizey,sizez


def filter_pocketatoms(atom):
if atom[:2] in ['Cd','Cs', 'Cn', 'Ce', 'Cm', 'Cf', 'Cl', 'Ca', \
'Cr', 'Co', 'Cu', 'Nh', 'Nd', 'Np', 'No', 'Ne', 'Na',\
'Ni','Nb', 'Os', 'Og', 'Hf', 'Hg', 'Hs', 'Ho', 'He',\
'Sr', 'Sn', 'Sb', 'Sg', 'Sm', 'Si', 'Sc', 'Se']:
return None
if atom[0] >= '0' and atom[0] <= '9':
return filter_pocketatoms(atom[1:])
if atom[0] in ['Z','M','P','D','F','K','I','B']:
return None
if atom[0] in allow_pocket_atoms:
return atom
return atom


def single_conf_gen(tgt_mol, num_confs=1000, seed=42, removeHs=True):
mol = copy.deepcopy(tgt_mol)
mol = Chem.AddHs(mol)
allconformers = AllChem.EmbedMultipleConfs(mol, numConfs=num_confs, randomSeed=seed, clearConfs=True)
sz = len(allconformers)
for i in range(sz):
try:
AllChem.MMFFOptimizeMolecule(mol, confId=i)
except:
continue
if removeHs:
mol = Chem.RemoveHs(mol)
return mol


def clustering_coords(mol, M=1000, N=100, seed=42, removeHs=True, method='bonds'):
rdkit_coords_list = []
if method == 'rdkit_MMFF':
rdkit_mol = single_conf_gen(mol, num_confs=M, seed=seed, removeHs=removeHs)
else:
print('no conformer generation methods:{}'.format(method))
raise
noHsIds = [rdkit_mol.GetAtoms()[i].GetIdx() for i in range(len(rdkit_mol.GetAtoms())) if rdkit_mol.GetAtoms()[i].GetAtomicNum()!=1]
### exclude hydrogens for aligning
AlignMolConformers(rdkit_mol, atomIds=noHsIds)
sz = len(rdkit_mol.GetConformers())
for i in range(sz):
_coords = rdkit_mol.GetConformers()[i].GetPositions().astype(np.float32)
rdkit_coords_list.append(_coords)

### exclude hydrogens for clustering
rdkit_coords_flatten = np.array(rdkit_coords_list)[:, noHsIds].reshape(sz,-1)
ids = KMeans(n_clusters=N, random_state=seed).fit_predict(rdkit_coords_flatten).tolist()
coords_list = [rdkit_coords_list[ids.index(i)] for i in range(N)]
return coords_list


def extract_pose_posebuster(content):

pdbid, ligid, protein_path, ligand_path, index = content

def read_pdb(path, pdbid):
#### protein preparation
pfile = os.path.join(path, pdbid+'.pdb')
pmol = PandasPdb().read_pdb(pfile)
return pmol

### totally posebuster data
def read_mol(path, pdbid, ligid):
lsdf = os.path.join(path, f'{pdbid}_{ligid}.sdf')
supp = Chem.SDMolSupplier(lsdf)
mols = [mol for mol in supp if mol]
if len(mols) == 0:
print(lsdf)
mol = mols[0]
return mol

# influence pocket size
dist_thres=6
if pdbid == 'index' or pdbid == 'readme':
return None

pmol = read_pdb(protein_path, pdbid)
pname = pdbid
mol = read_mol(ligand_path, pdbid, ligid)
mol = Chem.RemoveHs(mol)
lcoords = mol.GetConformer().GetPositions().astype(np.float32)
pdf = pmol.df['ATOM']
filter_std = []
for lcoord in lcoords:
pdf['dist'] = pmol.distance(xyz=list(lcoord), records=('ATOM'))
df = pdf[(pdf.dist <= dist_thres) & (pdf.element_symbol != 'H')][['chain_id', 'residue_number']]
filter_std += list(zip(df.chain_id.tolist(), df.residue_number.tolist()))

filter_std = set(filter_std)
patoms, pcoords, residues = [], np.empty((0,3)), []
for id,res in filter_std:
df = pdf[(pdf.chain_id == id) & (pdf.residue_number == res)]
patoms += df['atom_name'].tolist()
pcoords = np.concatenate((pcoords, df[['x_coord','y_coord','z_coord']].to_numpy()), axis=0)
residues += [str(id)+str(res)]*len(df)

if len(pcoords)==0:
print('empty pocket:', pdbid)
return None
config,centerx,centery,centerz,sizex,sizey,sizez = cal_configs(pcoords)

# filter unnormal atoms, include metal
atoms, index, residues_tmp = [], [], []
for i,a in enumerate(patoms):
output = filter_pocketatoms(a)
if output is not None:
index.append(True)
atoms.append(output)
residues_tmp.append(residues[i])
else:
index.append(False)
coordinates = pcoords[index].astype(np.float32)
residues = residues_tmp

assert len(atoms) == len(residues)
assert len(atoms) == coordinates.shape[0]

if len(atoms) != coordinates.shape[0]:
print(pname)
return None
patoms = atoms
pcoords = [coordinates]
side = [0 if a in main_atoms else 1 for a in patoms]

smiles = Chem.MolToSmiles(mol)
mol = AllChem.AddHs(mol, addCoords=True)
latoms = [atom.GetSymbol() for atom in mol.GetAtoms()]
holo_coordinates = [mol.GetConformer().GetPositions().astype(np.float32)]
holo_mol = mol
M, N = 100, 10
coordinate_list = clustering_coords(mol, M=M, N=N, seed=42, removeHs=False, method='rdkit_MMFF')
mol_list = [mol]*N
ligand = [latoms, coordinate_list, holo_coordinates, smiles, mol_list, holo_mol]

return pname, patoms, pcoords, side, residues, config, ligand


def parser(content):
pname, patoms, pcoords, side, residues, config, ligand = extract_pose_posebuster(content)
latoms, coordinate_list, holo_coordinates, smiles, mol_list, holo_mol = ligand
pickle.dumps({})
return pickle.dumps(
{
"atoms": latoms,
"coordinates": coordinate_list,
"mol_list": mol_list,
"pocket_atoms": patoms,
"pocket_coordinates": pcoords,
"side": side,
"residue": residues,
"config": config,
"holo_coordinates": holo_coordinates,
"holo_mol": holo_mol,
"holo_pocket_coordinates": pcoords,
"smi": smiles,
'pocket':pname,
'scaffold':pname,
},
protocol=-1,
)


def write_lmdb(protein_path, ligand_path, outpath, meta_info_file, lmdb_name, nthreads=8):
os.makedirs(outpath, exist_ok=True)
df = pd.read_csv(meta_info_file)
pdb_ids = list(df['pdb_code'].values)
lig_ids = list(df['lig_code'].values)
content_list = list(zip(pdb_ids, lig_ids, [protein_path]*len(pdb_ids), [ligand_path]*len(pdb_ids), range(len(pdb_ids))))
outputfilename = os.path.join(outpath, lmdb_name +'.lmdb')
try:
os.remove(outputfilename)
except:
pass
env_new = lmdb.open(
outputfilename,
subdir=False,
readonly=False,
lock=False,
readahead=False,
meminit=False,
max_readers=1,
map_size=int(100e9),
)
txn_write = env_new.begin(write=True)
print("Start preprocessing data...")
print(f'Number of systems: {len(pdb_ids)}')
with Pool(nthreads) as pool:
i = 0
failed_num = 0
for inner_output in tqdm(pool.imap(parser, content_list)):
if inner_output is not None:
txn_write.put(f"{i}".encode("ascii"), inner_output)
i+=1
elif inner_output is None:
failed_num += 1
txn_write.commit()
env_new.close()
print(f'Total num: {len(pdb_ids)}, Success: {i}, Failed: {failed_num}')
print("Done!")
代码
文本

Generate the model input lmdb file from protein pdb files and ligand sdf files

Data Description for eval_sets

  • The PoseBusters data (428 entries) and Astex data (85 entries) are stored respectively in the posebusters and astex folders under eval_sets. The parse_protein.py script in the same directory is used to process the raw downloaded pdb and sdf files.

  • In the posebusters directory, the proteins and ligands folders respectively store the processed pdb and sdf files after running the parse_protein script. The naming format for pdb files is {pdb_code}.pdb, and for sdf files, it is {pdb_code}_{lig_code}.sdf. The posebuster_set_meta.csv file contains the pdb code, ligand code, and corresponding download urls for the entries in the PoseBusters benchmark. The raw data is downloaded from PDB via these urls.

  • Astex has the similar data directory structure as PoseBusters.

代码
文本
[ ]
### workspace
project_path='/workspace/Uni-Mol'

# num of threads during preprocessing, the same as the num of CPUs.
nthreads = 12

### for posebusters
protein_path = f'{project_path}/eval_sets/posebusters/proteins'
ligand_path = f'{project_path}/eval_sets/posebusters/ligands'
lmdb_path = f'{project_path}/posebuster'
meta_info_file = f'{project_path}/eval_sets/posebusters/posebuster_set_meta.csv'
lmdb_name = 'posebuster_428'

### for astex
# protein_path = f'{project_path}/eval_sets/astex/proteins'
# ligand_path = f'{project_path}/eval_sets/astex/ligands'
# lmdb_path = f'{project_path}/astex'
# meta_info_file = f'{project_path}/eval_sets/astex/astex_set_meta.csv'
# lmdb_name = 'astex_85'

### generate lmdb
write_lmdb(protein_path, ligand_path, lmdb_path, meta_info_file, lmdb_name, nthreads=nthreads)
代码
文本

Infer with public ckp

The script is the same as it is in the Uni-Mol Readme

The trained checkpoint for protein-ligand binding pose prediction is also obtained from the Uni-Mol repo

代码
文本
[ ]
data_path=lmdb_path
results_path=f'{project_path}/infer_pose' # replace to your results path
weight_path=f'{project_path}/ckp/binding_pose_220908.pt'
batch_size=8
dist_threshold=8.0
recycling=3
valid_subset=lmdb_name
mol_dict_name='dict_mol.txt'
pocket_dict_name='dict_pkt.txt'

!cp $project_path/example_data/molecule/dict.txt $data_path/$mol_dict_name
!cp $project_path/example_data/pocket/dict_coarse.txt $data_path/$pocket_dict_name
!python $project_path/unimol/infer.py --user-dir $project_path/unimol $data_path --valid-subset $valid_subset \
--results-path $results_path \
--num-workers 8 --ddp-backend=c10d --batch-size $batch_size \
--task docking_pose --loss docking_pose --arch docking_pose \
--path $weight_path \
--fp16 --fp16-init-scale 4 --fp16-scale-window 256 \
--dist-threshold $dist_threshold --recycling $recycling \
--log-interval 50 --log-format simple
代码
文本

Docking, then calculate RMSD metrics:

The script is the same as it is in the Readme

The RMSD calculated here is pure RMSD, not considering symmetry.

代码
文本
[ ]
nthreads=nthreads
predict_file=f"{results_path}/ckp_{lmdb_name}.out.pkl" # Your inference file dir
reference_file=f"{lmdb_path}/{lmdb_name}.lmdb" # Your reference file dir
output_path=f"{project_path}/{lmdb_name}_predict_sdf" # Docking results path

%cd $project_path
!python $project_path/unimol/utils/docking.py --nthreads $nthreads --predict-file $predict_file --reference-file $reference_file --output-path $output_path
代码
文本

Calculate symmetric RMSD metrics

代码
文本
[ ]
from rdkit.Chem.rdMolAlign import CalcRMS

def get_mol(sdf_path):
supp = Chem.SDMolSupplier(sdf_path)
mols = [mol for mol in supp if mol]
if len(mols) == 0:
print(lsdf)
mol = mols[0]
return mol

def get_sym_rmsd(predicted_sdf_path, reference_sdf_path, meta_info_file):
df = pd.read_csv(meta_info_file)
pdb_ids = list(df['pdb_code'].values)
lig_ids = list(df['lig_code'].values)
sym_rmsd_results = []
for pdbid, ligid in zip(pdb_ids, lig_ids):
ref_sdf = os.path.join(reference_sdf_path, f'{pdbid}_{ligid}.sdf')
prb_sdf = os.path.join(predicted_sdf_path, f'{pdbid}.ligand.sdf')
ref_mol = get_mol(ref_sdf)
prb_mol = get_mol(prb_sdf)
sym_rmsd = CalcRMS(
Chem.RemoveHs(prb_mol),
Chem.RemoveHs(ref_mol)
)
sym_rmsd_results.append(sym_rmsd)
sym_rmsd_results = np.array(sym_rmsd_results)
return sym_rmsd_results

def print_results(rmsd_results):
print('*'*50)
print(f'results length: {len(rmsd_results)}')
print('RMSD < 1.0 : ', np.mean(rmsd_results<1.0))
print('RMSD < 1.5 : ', np.mean(rmsd_results<1.5))
print('RMSD < 2.0 : ', np.mean(rmsd_results<2.0))
print('RMSD < 3.0 : ', np.mean(rmsd_results<3.0))
print('RMSD < 5.0 : ', np.mean(rmsd_results<5.0))
print('avg RMSD : ', np.mean(rmsd_results))
代码
文本
[ ]
predicted_sdf_path = f'{output_path}/cache'
reference_sdf_path = ligand_path

### cal sym rmsd metrics
rmsd_results = get_sym_rmsd(predicted_sdf_path, reference_sdf_path, meta_info_file)
print_results(rmsd_results)
代码
文本
Uni-Mol
docking
Uni-Moldocking
已赞4
本文被以下合集收录
Uni-Mol最佳实践
Zhifeng Gao
更新于 2024-07-22
5 篇6 人关注
分子对接
18895326035@163.com
更新于 2024-05-21
6 篇0 人关注
推荐阅读
公开
Uni-ELF Ideas & App Heads-on Explained
Uni-ELFUni-MolPiloteyeEnglishBohrium Apps
Uni-ELFUni-MolPiloteyeEnglishBohrium Apps
chenx@dp.tech
更新于 2024-09-09
公开
code_test.ipynb
自动驾驶
自动驾驶
xuxh@dp.tech
更新于 2024-08-20
评论
 <div style="color:bl...

段辰儒

11-26 09:07
Hi, I do not see the image stated here. The only unimol related image I see is "unimol-qsar:v0.4".

段辰儒

11-26 10:36
No worries. unimol-qsar:v0.4 works fine
评论