©️ Copyright 2023 @ Authors
Author:
Gengmo Zhou 📨
Date:2023-11-16
Licenses:This Bohrium notebook uses the Uni-Mol model parameters, and its outputs are under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You can find details at: http://creativecommons.org/licenses/by-nc-sa/4.0
Quick Start:Click the Connect button above,select unimol-docking:pytorch1.12.1-cuda11.6 image and GPU machine to start。
Uni-Mol achieved the best open source result on the PoseBusters docking benchmark!
Objective for this demo
This demo aims to help readers reproduce the results of Uni-Mol on the PoseBusters benchmark from scratch.
Background
About Uni-Mol
- Code link: https://github.com/dptech-corp/Uni-Mol
- Paper link: https://openreview.net/forum?id=6K2RM6wVqKu
Uni-Mol is a universal 3D molecular representation learning framework based on molecular structures released by DP Technology in May 2022. Uni-Mol contains two pretrained models with the same SE(3) Transformer architecture: a molecular model pretrained by 209M molecular conformations; a pocket model pretrained by 3M candidate protein pocket data.
Besides, the utilization of 3D structures and effective pretraining schemes enable Uni-Mol outperforms SOTA in 14/15 molecular property prediction tasks. Moreover, Uni-Mol achieves superior performance in 3D spatial tasks, including protein-ligand binding pose prediction, molecular conformation generation, etc. The paper has been accepted by the top machine learning conference, ICLR 2023.
About PoseBusters
- Paper link: https://arxiv.org/pdf/2308.05777.pdf
PoseBusters is a Python package that performs a series of standard quality checks using the well-established cheminformatics toolkit RDKit. Only methods that both pass these checks and predict native-like binding modes should be classed as having “state-of-the-art” performance.
The PoseBusters Benchmark set is a new set of carefully-selected publicly-available crystal complexes from the PDB. It is a diverse set of recent high-quality protein-ligand complexes which contain drug-like molecules. It only contains complexes released since 2021 and therefore does not contain any complexes present in the PDBbind General Set v2020 used to train Uni-Mol.
Please note before running:
Environment
- Basic Docker image:
dptechnology/unicore:latest-pytorch1.12.1-cuda11.6-rdma
- Other dependencies: RDKit and BioPandas:
rdkit==2022.9.3
biopandas==0.4.1
- Data: protein pdf files and ligand sdf files of
PoseBusters
andAstex
Code, data and model
Code link: https://github.com/dptech-corp/Uni-Mol
Commit: b962451 (b962451a019e15363bd34b3af9d3a3cd02330947)
Project path:
/workspace/Uni-Mol
Data path:
/workspace/Uni-Mol/eval_sets
Model path:
/workspace/Uni-Mol/ckp/binding_pose_220908.pt
(from public repo)
Pipeline
Import modules
Data preprocessing func for generating lmdb
files
Ligand preparation
The molecule is obtained from the sdf file and generate 100 conformations for it using RDKit. Then, these conformations are clustered into 10 by k-means and use them as the initial input for the model.
Protein preparation
The binding pockets residues are those within 6 Å of any crystal ligand heavy atom. Then ,the atoms are extracted from these residues and filter out metal and rare element atoms to obtain the pocket atoms for model input.
Generate the model input lmdb
file from protein pdb
files and ligand sdf
files
Data Description for eval_sets
The PoseBusters data (428 entries) and Astex data (85 entries) are stored respectively in the
posebusters
andastex
folders undereval_sets
. Theparse_protein.py
script in the same directory is used to process the raw downloaded pdb and sdf files.In the
posebusters
directory, the proteins and ligands folders respectively store the processedpdb
andsdf
files after running theparse_protein
script. The naming format for pdb files is{pdb_code}.pdb
, and for sdf files, it is{pdb_code}_{lig_code}.sdf
. Theposebuster_set_meta.csv
file contains the pdb code, ligand code, and corresponding download urls for the entries in the PoseBusters benchmark. The raw data is downloaded from PDB via these urls.Astex has the similar data directory structure as PoseBusters.
Infer with public ckp
The script is the same as it is in the Uni-Mol Readme
The trained checkpoint for protein-ligand binding pose prediction is also obtained from the Uni-Mol repo
Docking, then calculate RMSD metrics:
The script is the same as it is in the Readme
The RMSD calculated here is pure RMSD, not considering symmetry.
段辰儒
段辰儒