©️ All rights reserved 2023 @ Author
Author:
Gengmo Zhou 📨
Date: 2024-7-24
Licenses: This Bohrium notebook uses Uni-Mol model parameters, and its output content follows the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You can find detailed information at: http://creativecommons.org/licenses/by-nc-sa/4.0
Quick Start: Click the Start Connection button above, select the unimol-docking:pytorch1.12.1-cuda11.6 image and GPU machine to start using, the cheapest one will do.
Background
About Uni-Mol
Uni-Mol is a universal 3D molecular representation learning framework based on molecular structures, released by DeepModeling in May 2022. Uni-Mol includes two pre-trained models, both adopting the same SE(3) Transformer architecture: one is a molecular model pre-trained with 209M molecular conformations; the other is a pocket model pre-trained with 3 million candidate protein pocket data.
Utilizing 3D structural information combined with an effective pre-training scheme enables Uni-Mol to surpass previous best methods in 14 out of 15 molecular property prediction tasks. Notably, Uni-Mol excels in 3D space-related tasks, including protein-ligand binding pose prediction, molecular conformation generation, etc. The paper has been accepted by the top machine learning conference ICLR 2023.
About PoseBusters
PoseBusters is a Python package that performs a series of standard quality checks using the well-known cheminformatics toolkit RDKit. Only those methods that pass these checks and predict binding modes similar to natural ones should be considered to have "state-of-the-art" performance.
The PoseBusters benchmark set is a new, carefully curated, publicly available set of crystal complexes from the PDB. It is a diverse, recent collection of high-quality protein-ligand complexes containing drug-like molecules. It only includes complexes released since 2021, thus excluding any complexes from the PDBbind general set v2020, which was used to train Uni-Mol.
Preparation Before Running::
Environment
- Base Docker image:
dptechnology/unicore:latest-pytorch1.12.1-cuda11.6-rdma
- Other dependencies: RDKit and BioPandas:
rdkit==2022.9.3
biopandas==0.4.1
- Data: Protein PDF files and ligand SDF files from
PoseBusters
andAstex
Code, Data, and Model
Code link: https://github.com/deepmodeling/Uni-Mol
Commit: b962451 (b962451a019e15363bd34b3af9d3a3cd02330947)
Project path:
/workspace/Uni-Mol
Data path:
/workspace/Uni-Mol/eval_sets
Model path:
/workspace/Uni-Mol/ckp/binding_pose_220908.pt
(can be downloaded from the GitHub repository)
Running
Import Modules
Data Preprocessing Function for Generating lmdb
Files
Ligand Preparation
Extract molecules from SDF files and use RDKit to generate 100 conformations for each. Then, cluster these conformations into 10 groups using k-means, and use them as the initial input for the model.
Protein Preparation
Protein pocket residues are defined as residues within 6 Å of any ligand crystal structure's heavy atoms. Then, extract atoms from these residues and filter out metal and rare element atoms to obtain the pocket atoms for model input.
Generating lmdb
Files for Model Input from Protein pdb
Files and Ligand sdf
Files
Data Description eval_sets
PoseBusters data (428 entries) and Astex data (85 entries) are stored in the
posebusters
andastex
folders undereval_sets
, respectively. Theparse_protein.py
script in the same directory is used to process the downloaded raw pdb and sdf files.In the
posebusters
directory, the protein and ligand folders store the processedpdb
andsdf
files after running theparse_protein
script. The naming format for pdb files is{pdb_code}.pdb
, and for sdf files, it is{pdb_code}_{lig_code}.sdf
. Theposebuster_set_meta.csv
file contains the pdb code, ligand code, and corresponding download URLs for each entry in the PoseBusters benchmark. The raw data is downloaded from PDB via these URLs.The data directory structure for Astex is similar to that of PoseBusters.
Here, for demonstration purposes, the first two complexes are selected.
Example of meta_info content: pdb_code lig_code prot_url \ 0 5S8I 2LY https://files.rcsb.org/download/5S8I.pdb lig_url \ 0 http://ligand-expo.rcsb.org/files/2/2LY/isdf/5... ligs 0 <rdkit.Chem.rdchem.Mol object at 0x172fc16c0> pdb code: ['5S8I', '5SAK'] lig code: ['2LY', 'ZRY'] Start preprocessing data... Number of systems: 2 2it [00:03, 1.58s/it] Total num: 2, Success: 2, Failed: 0 Done!
Inference Using Public Model Weights
This script is the same as the one in the Uni-Mol Readme.
The model weights for protein-ligand binding pose prediction can also be obtained from the Uni-Mol repository.
2024-07-23 18:59:12 | INFO | unimol.inference | loading model(s) from /workspace/Uni-Mol/ckp/binding_pose_220908.pt 2024-07-23 18:59:12 | INFO | unimol.tasks.docking_pose | ligand dictionary: 30 types 2024-07-23 18:59:12 | INFO | unimol.tasks.docking_pose | pocket dictionary: 9 types 2024-07-23 18:59:16 | INFO | unimol.inference | Namespace(activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9, 0.999)', adam_eps=1e-08, all_gather_list_size=16384, allreduce_fp32_grad=False, arch='docking_pose', attention_dropout=0.1, batch_size=8, batch_size_valid=8, bf16=False, bf16_sr=False, broadcast_buffers=False, bucket_cap_mb=25, conf_size=10, cpu=False, curriculum=0, data='/workspace/Uni-Mol/posebuster', data_buffer_size=10, ddp_backend='c10d', delta_pair_repr_norm_loss=-1.0, device_id=0, disable_validation=False, dist_threshold=8.0, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.1, ema_decay=-1.0, emb_dropout=0.1, empty_cache_freq=0, encoder_attention_heads=64, encoder_embed_dim=512, encoder_ffn_embed_dim=2048, encoder_layers=15, fast_stat_sync=False, find_unused_parameters=False, finetune_mol_model=None, finetune_pocket_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=256, log_format='simple', log_interval=50, loss='docking_pose', lr_scheduler='fixed', lr_shrink=0.1, masked_coord_loss=-1.0, masked_dist_loss=-1.0, masked_token_loss=-1.0, max_pocket_atoms=256, max_seq_len=512, max_valid_steps=None, min_loss_scale=0.0001, model_overrides='{}', mol=Namespace(activation_dropout=0.0, activation_fn='gelu', attention_dropout=0.1, delta_pair_repr_norm_loss=-1.0, dropout=0.1, emb_dropout=0.1, encoder_attention_heads=64, encoder_embed_dim=512, encoder_ffn_embed_dim=2048, encoder_layers=15, masked_coord_loss=-1.0, masked_dist_loss=-1.0, masked_token_loss=-1.0, max_seq_len=512, pooler_activation_fn='tanh', pooler_dropout=0.0, post_ln=False, x_norm_loss=-1.0), no_progress_bar=False, no_seed_provided=False, nprocs_per_node=1, num_workers=8, optimizer='adam', path='/workspace/Uni-Mol/ckp/binding_pose_220908.pt', pocket=Namespace(activation_dropout=0.0, activation_fn='gelu', attention_dropout=0.1, delta_pair_repr_norm_loss=-1.0, dropout=0.1, emb_dropout=0.1, encoder_attention_heads=64, encoder_embed_dim=512, encoder_ffn_embed_dim=2048, encoder_layers=15, masked_coord_loss=-1.0, masked_dist_loss=-1.0, masked_token_loss=-1.0, max_seq_len=512, pooler_activation_fn='tanh', pooler_dropout=0.0, post_ln=False, x_norm_loss=-1.0), pooler_activation_fn='tanh', pooler_dropout=0.0, post_ln=False, profile=False, quiet=False, recycling=3, required_batch_size_multiple=1, results_path='/workspace/Uni-Mol/infer_pose', seed=1, skip_invalid_size_inputs_valid_test=False, suppress_crashes=False, task='docking_pose', tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', user_dir='/workspace/Uni-Mol/unimol', valid_subset='posebuster_428', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, validate_with_ema=False, wandb_name='', wandb_project='', warmup_updates=0, weight_decay=0.0, x_norm_loss=-1.0) 2024-07-23 18:59:16 | INFO | unicore.tasks.unicore_task | get EpochBatchIterator for epoch 1 2024-07-23 18:59:20 | INFO | unimol.inference | Done inference!
Perform Docking Based on the Predicted Distance Matrix, Then Calculate the RMSD Metric:
The script is the same as the one in the Uni-Mol Readme.
The RMSD calculated here is the pure RMSD, without considering symmetry.
/workspace/Uni-Mol 100%|█████████████████████████████████████████████| 2/2 [00:00<00:00, 12.41it/s] 0%| | 0/2 [00:00<?, ?it/s]5SAK-N=C1N/C(=N\Nc2ccccc2)c2ccccc21-RMSD:0.6344-0.0361-0.2067 5S8I-CNC(=O)c1scc2c1OCCO2-RMSD:1.0975-0.0144-0.2149 100%|█████████████████████████████████████████████| 2/2 [00:23<00:00, 11.97s/it] RMSD < 1.0 : 0.5 RMSD < 1.5 : 1.0 RMSD < 2.0 : 1.0 RMSD < 3.0 : 1.0 RMSD < 5.0 : 1.0 avg RMSD : 0.8659875802603305
Calculate Symmetric RMSD Metric
calc rmsd for: pdb code: ['5S8I', '5SAK'] lig code: ['2LY', 'ZRY'] ************************************************** results length: 2 RMSD < 1.0 : 0.5 RMSD < 1.5 : 1.0 RMSD < 2.0 : 1.0 RMSD < 3.0 : 1.0 RMSD < 5.0 : 1.0 avg RMSD : 0.8659898326762625
Prediction Structure Visualization
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Requirement already satisfied: py3Dmol in /opt/conda/lib/python3.8/site-packages (2.2.1) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
In the image
The green molecule is the structure predicted by unimol
The red molecule is the crystal structure