Demo for Uni-Mol Docking on PoseBusters | Bohrium-玻尔科研空间站

空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

Demo for Uni-Mol Docking on PoseBusters

Uni-Mol

docking

Uni-Moldocking

zhougm@dp.tech

发布于 2023-11-16

推荐镜像 :unimol-docking:pytorch1.12.1-cuda11.6

推荐机型 :c12_m46_1 * NVIDIA GPU B

Uni-Mol achieved the best open source result on the PoseBusters docking benchmark！

Objective for this demo

Background

About Uni-Mol

About PoseBusters

Please note before running:

Environment

Code, data and model

Pipeline

Import modules

Data preprocessing func for generating lmdb files

Generate the model input lmdb file from protein pdb files and ligand sdf files

Infer with public ckp

Docking, then calculate RMSD metrics:

Calculate symmetric RMSD metrics

©️ Copyright 2023 @ Authors
Author： Gengmo Zhou 📨
Date：2023-11-16
Licenses：This Bohrium notebook uses the Uni-Mol model parameters, and its outputs are under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You can find details at: http://creativecommons.org/licenses/by-nc-sa/4.0
Quick Start：Click the Connect button above，select unimol-docking:pytorch1.12.1-cuda11.6 image and GPU machine to start。

代码

文本

Uni-Mol achieved the best open source result on the PoseBusters docking benchmark！

alt img_v3_025c_0dbe5a36-1e6b-41f7-bc4a-2d60bd54282g.png

代码

文本

Objective for this demo

This demo aims to help readers reproduce the results of Uni-Mol on the PoseBusters benchmark from scratch.

代码

文本

Background

代码

文本

About Uni-Mol

Code link: https://github.com/dptech-corp/Uni-Mol
Paper link: https://openreview.net/forum?id=6K2RM6wVqKu

Uni-Mol is a universal 3D molecular representation learning framework based on molecular structures released by DP Technology in May 2022. Uni-Mol contains two pretrained models with the same SE(3) Transformer architecture: a molecular model pretrained by 209M molecular conformations; a pocket model pretrained by 3M candidate protein pocket data.

Besides, the utilization of 3D structures and effective pretraining schemes enable Uni-Mol outperforms SOTA in 14/15 molecular property prediction tasks. Moreover, Uni-Mol achieves superior performance in 3D spatial tasks, including protein-ligand binding pose prediction, molecular conformation generation, etc. The paper has been accepted by the top machine learning conference, ICLR 2023.

代码

文本

About PoseBusters

Paper link: https://arxiv.org/pdf/2308.05777.pdf

PoseBusters is a Python package that performs a series of standard quality checks using the well-established cheminformatics toolkit RDKit. Only methods that both pass these checks and predict native-like binding modes should be classed as having “state-of-the-art” performance.

The PoseBusters Benchmark set is a new set of carefully-selected publicly-available crystal complexes from the PDB. It is a diverse set of recent high-quality protein-ligand complexes which contain drug-like molecules. It only contains complexes released since 2021 and therefore does not contain any complexes present in the PDBbind General Set v2020 used to train Uni-Mol.

代码

文本

Please note before running:

Environment

Basic Docker image:

dptechnology/unicore:latest-pytorch1.12.1-cuda11.6-rdma

Other dependencies: RDKit and BioPandas:

rdkit==2022.9.3
biopandas==0.4.1

Data: protein pdf files and ligand sdf files of PoseBusters and Astex

Code, data and model

Code link: https://github.com/dptech-corp/Uni-Mol
Commit: b962451 (b962451a019e15363bd34b3af9d3a3cd02330947)
Project path: /workspace/Uni-Mol
Data path: /workspace/Uni-Mol/eval_sets
Model path: /workspace/Uni-Mol/ckp/binding_pose_220908.pt (from public repo)

代码

文本

Pipeline

Import modules

代码

文本

[1]

import os

import pickle

import numpy as np

import pandas as pd

from rdkit import Chem, RDLogger

from rdkit.Chem import AllChem

from tqdm import tqdm

RDLogger.DisableLog('rdApp.*')

import warnings

warnings.filterwarnings(action='ignore')

from multiprocessing import Pool

import copy

import lmdb

from biopandas.pdb import PandasPdb

from sklearn.cluster import KMeans

from rdkit.Chem.rdMolAlign import AlignMolConformers

代码

文本

Data preprocessing func for generating `lmdb` files

Ligand preparation

The molecule is obtained from the sdf file and generate 100 conformations for it using RDKit. Then, these conformations are clustered into 10 by k-means and use them as the initial input for the model.

Protein preparation

The binding pockets residues are those within 6 Å of any crystal ligand heavy atom. Then ,the atoms are extracted from these residues and filter out metal and rare element atoms to obtain the pocket atoms for model input.

代码

文本

[2]

999

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

›

⌄

# allowed atom types

main_atoms = ['N', 'CA', 'C', 'O', 'H']

allow_pocket_atoms = ['C', 'H', 'N', 'O', 'S']

def cal_configs(coords):

"""Calculate pocket configs"""

centerx,centery,centerz = list((np.max(coords,axis=0)+np.min(coords,axis=0))/2)

sizex,sizey,sizez = list(np.max(coords,axis=0)-np.mean(coords,axis=0))

config = {'cx':centerx,'cy':centery,'cz':centerz,

'sx':sizex,'sy':sizey,'sz':sizez}

return config,centerx,centery,centerz,sizex,sizey,sizez

def filter_pocketatoms(atom):

if atom[:2] in ['Cd','Cs', 'Cn', 'Ce', 'Cm', 'Cf', 'Cl', 'Ca', \

'Cr', 'Co', 'Cu', 'Nh', 'Nd', 'Np', 'No', 'Ne', 'Na',\

'Ni','Nb', 'Os', 'Og', 'Hf', 'Hg', 'Hs', 'Ho', 'He',\

'Sr', 'Sn', 'Sb', 'Sg', 'Sm', 'Si', 'Sc', 'Se']:

return None

if atom[0] >= '0' and atom[0] <= '9':

return filter_pocketatoms(atom[1:])

if atom[0] in ['Z','M','P','D','F','K','I','B']:

return None

if atom[0] in allow_pocket_atoms:

return atom

def single_conf_gen(tgt_mol, num_confs=1000, seed=42, removeHs=True):

mol = copy.deepcopy(tgt_mol)

mol = Chem.AddHs(mol)

allconformers = AllChem.EmbedMultipleConfs(mol, numConfs=num_confs, randomSeed=seed, clearConfs=True)

sz = len(allconformers)

for i in range(sz):

try:

AllChem.MMFFOptimizeMolecule(mol, confId=i)

except:

continue

if removeHs:

mol = Chem.RemoveHs(mol)

return mol

def clustering_coords(mol, M=1000, N=100, seed=42, removeHs=True, method='bonds'):

rdkit_coords_list = []

if method == 'rdkit_MMFF':

rdkit_mol = single_conf_gen(mol, num_confs=M, seed=seed, removeHs=removeHs)

else:

print('no conformer generation methods:{}'.format(method))

raise

noHsIds = [rdkit_mol.GetAtoms()[i].GetIdx() for i in range(len(rdkit_mol.GetAtoms())) if rdkit_mol.GetAtoms()[i].GetAtomicNum()!=1]

### exclude hydrogens for aligning

AlignMolConformers(rdkit_mol, atomIds=noHsIds)

sz = len(rdkit_mol.GetConformers())

for i in range(sz):

_coords = rdkit_mol.GetConformers()[i].GetPositions().astype(np.float32)

rdkit_coords_list.append(_coords)

### exclude hydrogens for clustering

rdkit_coords_flatten = np.array(rdkit_coords_list)[:, noHsIds].reshape(sz,-1)

ids = KMeans(n_clusters=N, random_state=seed).fit_predict(rdkit_coords_flatten).tolist()

coords_list = [rdkit_coords_list[ids.index(i)] for i in range(N)]

return coords_list

def extract_pose_posebuster(content):

pdbid, ligid, protein_path, ligand_path, index = content

def read_pdb(path, pdbid):

#### protein preparation

pfile = os.path.join(path, pdbid+'.pdb')

pmol = PandasPdb().read_pdb(pfile)

return pmol

### totally posebuster data

def read_mol(path, pdbid, ligid):

lsdf = os.path.join(path, f'{pdbid}_{ligid}.sdf')

supp = Chem.SDMolSupplier(lsdf)

mols = [mol for mol in supp if mol]

if len(mols) == 0:

print(lsdf)

mol = mols[0]

return mol

# influence pocket size

dist_thres=6

if pdbid == 'index' or pdbid == 'readme':

return None

pmol = read_pdb(protein_path, pdbid)

pname = pdbid

mol = read_mol(ligand_path, pdbid, ligid)

mol = Chem.RemoveHs(mol)

lcoords = mol.GetConformer().GetPositions().astype(np.float32)

pdf = pmol.df['ATOM']

filter_std = []

for lcoord in lcoords:

pdf['dist'] = pmol.distance(xyz=list(lcoord), records=('ATOM'))

df = pdf[(pdf.dist <= dist_thres) & (pdf.element_symbol != 'H')][['chain_id', 'residue_number']]

filter_std += list(zip(df.chain_id.tolist(), df.residue_number.tolist()))

filter_std = set(filter_std)

patoms, pcoords, residues = [], np.empty((0,3)), []

for id,res in filter_std:

df = pdf[(pdf.chain_id == id) & (pdf.residue_number == res)]

patoms += df['atom_name'].tolist()

pcoords = np.concatenate((pcoords, df[['x_coord','y_coord','z_coord']].to_numpy()), axis=0)

residues += [str(id)+str(res)]*len(df)

if len(pcoords)==0:

print('empty pocket:', pdbid)

return None

config,centerx,centery,centerz,sizex,sizey,sizez = cal_configs(pcoords)

# filter unnormal atoms, include metal

atoms, index, residues_tmp = [], [], []

for i,a in enumerate(patoms):

output = filter_pocketatoms(a)

if output is not None:

index.append(True)

atoms.append(output)

residues_tmp.append(residues[i])

else:

index.append(False)

coordinates = pcoords[index].astype(np.float32)

residues = residues_tmp

assert len(atoms) == len(residues)

assert len(atoms) == coordinates.shape[0]

if len(atoms) != coordinates.shape[0]:

print(pname)

return None

patoms = atoms

pcoords = [coordinates]

side = [0 if a in main_atoms else 1 for a in patoms]

smiles = Chem.MolToSmiles(mol)

mol = AllChem.AddHs(mol, addCoords=True)

latoms = [atom.GetSymbol() for atom in mol.GetAtoms()]

holo_coordinates = [mol.GetConformer().GetPositions().astype(np.float32)]

holo_mol = mol

M, N = 100, 10

coordinate_list = clustering_coords(mol, M=M, N=N, seed=42, removeHs=False, method='rdkit_MMFF')

mol_list = [mol]*N

ligand = [latoms, coordinate_list, holo_coordinates, smiles, mol_list, holo_mol]

return pname, patoms, pcoords, side, residues, config, ligand

def parser(content):

pname, patoms, pcoords, side, residues, config, ligand = extract_pose_posebuster(content)

latoms, coordinate_list, holo_coordinates, smiles, mol_list, holo_mol = ligand

pickle.dumps({})

return pickle.dumps(

{

"atoms": latoms,

"coordinates": coordinate_list,

"mol_list": mol_list,

"pocket_atoms": patoms,

"pocket_coordinates": pcoords,

"side": side,

"residue": residues,

"config": config,

"holo_coordinates": holo_coordinates,

"holo_mol": holo_mol,

"holo_pocket_coordinates": pcoords,

"smi": smiles,

'pocket':pname,

'scaffold':pname,

protocol=-1,

)

def write_lmdb(protein_path, ligand_path, outpath, meta_info_file, lmdb_name, nthreads=8):

os.makedirs(outpath, exist_ok=True)

df = pd.read_csv(meta_info_file)

pdb_ids = list(df['pdb_code'].values)

lig_ids = list(df['lig_code'].values)

content_list = list(zip(pdb_ids, lig_ids, [protein_path]*len(pdb_ids), [ligand_path]*len(pdb_ids), range(len(pdb_ids))))

outputfilename = os.path.join(outpath, lmdb_name +'.lmdb')

try:

os.remove(outputfilename)

except:

pass

env_new = lmdb.open(

outputfilename,

subdir=False,

readonly=False,

lock=False,

readahead=False,

meminit=False,

max_readers=1,

map_size=int(100e9),

)

txn_write = env_new.begin(write=True)

print("Start preprocessing data...")

print(f'Number of systems: {len(pdb_ids)}')

with Pool(nthreads) as pool:

i = 0

failed_num = 0

for inner_output in tqdm(pool.imap(parser, content_list)):

if inner_output is not None:

txn_write.put(f"{i}".encode("ascii"), inner_output)

i+=1

elif inner_output is None:

failed_num += 1

txn_write.commit()

env_new.close()

print(f'Total num: {len(pdb_ids)}, Success: {i}, Failed: {failed_num}')

print("Done!")

代码

文本

Generate the model input `lmdb` file from protein `pdb` files and ligand `sdf` files

Data Description for eval_sets

The PoseBusters data (428 entries) and Astex data (85 entries) are stored respectively in the posebusters and astex folders under eval_sets. The parse_protein.py script in the same directory is used to process the raw downloaded pdb and sdf files.
In the posebusters directory, the proteins and ligands folders respectively store the processed pdb and sdf files after running the parse_protein script. The naming format for pdb files is {pdb_code}.pdb, and for sdf files, it is {pdb_code}_{lig_code}.sdf. The posebuster_set_meta.csv file contains the pdb code, ligand code, and corresponding download urls for the entries in the PoseBusters benchmark. The raw data is downloaded from PDB via these urls.
Astex has the similar data directory structure as PoseBusters.

代码

文本

[ ]

### workspace

project_path='/workspace/Uni-Mol'

# num of threads during preprocessing, the same as the num of CPUs.

nthreads = 12

### for posebusters

protein_path = f'{project_path}/eval_sets/posebusters/proteins'

ligand_path = f'{project_path}/eval_sets/posebusters/ligands'

lmdb_path = f'{project_path}/posebuster'

meta_info_file = f'{project_path}/eval_sets/posebusters/posebuster_set_meta.csv'

lmdb_name = 'posebuster_428'

### for astex

# protein_path = f'{project_path}/eval_sets/astex/proteins'

# ligand_path = f'{project_path}/eval_sets/astex/ligands'

# lmdb_path = f'{project_path}/astex'

# meta_info_file = f'{project_path}/eval_sets/astex/astex_set_meta.csv'

# lmdb_name = 'astex_85'

### generate lmdb

write_lmdb(protein_path, ligand_path, lmdb_path, meta_info_file, lmdb_name, nthreads=nthreads)

代码

文本

Infer with public ckp

The script is the same as it is in the Uni-Mol Readme

The trained checkpoint for protein-ligand binding pose prediction is also obtained from the Uni-Mol repo

代码

文本

[ ]

data_path=lmdb_path

results_path=f'{project_path}/infer_pose' # replace to your results path

weight_path=f'{project_path}/ckp/binding_pose_220908.pt'

batch_size=8

dist_threshold=8.0

recycling=3

valid_subset=lmdb_name

mol_dict_name='dict_mol.txt'

pocket_dict_name='dict_pkt.txt'

!cp $project_path/example_data/molecule/dict.txt $data_path/$mol_dict_name

!cp $project_path/example_data/pocket/dict_coarse.txt $data_path/$pocket_dict_name

!python $project_path/unimol/infer.py --user-dir $project_path/unimol $data_path --valid-subset $valid_subset \

--results-path $results_path \

--num-workers 8 --ddp-backend=c10d --batch-size $batch_size \

--task docking_pose --loss docking_pose --arch docking_pose \

--path $weight_path \

--fp16 --fp16-init-scale 4 --fp16-scale-window 256 \

--dist-threshold $dist_threshold --recycling $recycling \

--log-interval 50 --log-format simple

代码

文本

Docking, then calculate RMSD metrics:

The script is the same as it is in the Readme

The RMSD calculated here is pure RMSD, not considering symmetry.

代码

文本

[ ]

nthreads=nthreads

predict_file=f"{results_path}/ckp_{lmdb_name}.out.pkl" # Your inference file dir

reference_file=f"{lmdb_path}/{lmdb_name}.lmdb" # Your reference file dir

output_path=f"{project_path}/{lmdb_name}_predict_sdf" # Docking results path

%cd $project_path

!python $project_path/unimol/utils/docking.py --nthreads $nthreads --predict-file $predict_file --reference-file $reference_file --output-path $output_path

代码

文本

Calculate symmetric RMSD metrics

代码

文本

[ ]

from rdkit.Chem.rdMolAlign import CalcRMS

def get_mol(sdf_path):

supp = Chem.SDMolSupplier(sdf_path)

mols = [mol for mol in supp if mol]

if len(mols) == 0:

print(lsdf)

mol = mols[0]

return mol

def get_sym_rmsd(predicted_sdf_path, reference_sdf_path, meta_info_file):

df = pd.read_csv(meta_info_file)

pdb_ids = list(df['pdb_code'].values)

lig_ids = list(df['lig_code'].values)

sym_rmsd_results = []

for pdbid, ligid in zip(pdb_ids, lig_ids):

ref_sdf = os.path.join(reference_sdf_path, f'{pdbid}_{ligid}.sdf')

prb_sdf = os.path.join(predicted_sdf_path, f'{pdbid}.ligand.sdf')

ref_mol = get_mol(ref_sdf)

prb_mol = get_mol(prb_sdf)

sym_rmsd = CalcRMS(

Chem.RemoveHs(prb_mol),

Chem.RemoveHs(ref_mol)

)

sym_rmsd_results.append(sym_rmsd)

sym_rmsd_results = np.array(sym_rmsd_results)

return sym_rmsd_results

def print_results(rmsd_results):

print('*'*50)

print(f'results length: {len(rmsd_results)}')

print('RMSD < 1.0 : ', np.mean(rmsd_results<1.0))

print('RMSD < 1.5 : ', np.mean(rmsd_results<1.5))

print('RMSD < 2.0 : ', np.mean(rmsd_results<2.0))

print('RMSD < 3.0 : ', np.mean(rmsd_results<3.0))

print('RMSD < 5.0 : ', np.mean(rmsd_results<5.0))

print('avg RMSD : ', np.mean(rmsd_results))

代码

文本

[ ]

predicted_sdf_path = f'{output_path}/cache'

reference_sdf_path = ligand_path

### cal sym rmsd metrics

rmsd_results = get_sym_rmsd(predicted_sdf_path, reference_sdf_path, meta_info_file)

print_results(rmsd_results)

代码

文本

Uni-Mol

docking

Uni-Moldocking

已赞4

本文被以下合集收录

Uni-Mol最佳实践

Zhifeng Gao

更新于 2024-07-22

5 篇6 人关注

分子对接

18895326035@163.com

更新于 2024-05-21

6 篇0 人关注