FFF: 基于蛋白片段预测的冷冻电镜结构自动搭建
机型:c16_m62_1 Nvidia T4 (可选择更高配置的显卡)
Copyright 2023 @ Authors
快速开始:点击上方的 开始连接 按钮(默认使用 fff-notebook:v0.2.3 镜像),稍等片刻即可运行。如您遇到任何问题,请联系 bohrium@dp.tech 。
提示:运行此notebook 需要用到含有T4显卡的非免费计算资源
冷冻电镜(Cryo-electron microscopy,简称Cryo-EM)是一种先进的生物成像技术,近年来已成为生物学领域解析分子结构的重要工具,尤其对于研究大型生物分子复合物和膜蛋白结构具有显著的优势。它的工作原理是通过冷冻生物样品,使生物分子在低温条件下保持原有的天然状态,然后利用高能电子束透射样品,收集透射电子图像,最后通过计算机进行图像处理和三维重建,得到生物分子的高分辨率三维结构。由于冷冻电镜避免了传统X射线晶体学需要制备晶体的繁琐过程,因此受到了广泛的关注。
2.自动建模:自动建模软件(如Phenix、Rosetta、ARP/wARP, MDFF等)能够根据密度图自动生成原子模型。这种方法具有较高的效率和一致性,减少了人为因素的干扰。然而,在密度图分辨率较低或模型复杂度较高的情况下,自动建模的准确性可能受到限制。
为了解决现有方案中的弊端,我们提出了一种新方法FFF("Fragment-guided Flexible Fitting”)[5],它可以从冷冻电镜实验数据中构建出更准确和完整的蛋白质结构。FFF通过结合蛋白质结构预测和蛋白质结构识别以及柔性拟合算法,实现了更可靠的冷冻电镜结构建模。
FFF 使用案例展示
AlphaFold2 预测结构
下面是AlphaFold2算法[3] 预测的结构,呈现朝内打开的结构,和我们的目标结构相差很大。
TM score (AlphaFold) TM-score = 0.5766 (d0= 7.54)
MDFF (Molecuar Dynamics Flexible Fitting) 是一种冷冻电镜结构搭建的传统方法[4],我们来试一下它在这个案例上的结构搭建效果。
从下面的结果可以看出,MDFF不能很好的搭建出符合密度图的三维原子模型结构,这主要是因为MDFF 在初始结构和目标结构相差很大时候很容易陷入局部最优解。
1. 密度图识别
我们首先需要将输入密度图转换成标准的密度图(像素大小为1 Å)以保证输入密度图和模型训练时用的密度图在体素大小上是一致的。另外,我们也需要生成一个方差图。
['7BCQ.apix1.ccp4', '7BCQ.apix1_apix_map.mrc', '7BCQ.apix1_res_3.0.dx', '7BCQ.apix1_std_map.mrc', '7BCQ_clean.pdb', '7BCQ_clean_chain.pdb', '7BCQ_clean_clean.pdb', '7BCQ_clean_no_hetero.pdb', '7BCQ_cmd.dcd', '7BCQ_cmd.pdb', '7BCQ_cmd.rst', '7BCQ_cmd.tmscore.txt', '7BCQ_cmd_cmd_config.yml', '7BCQ_infer.cif', '7BCQ_infer.pdb', '7BCQ_infer.txt', '7BCQ_infer_backbone.mrc', '7BCQ_restr.exb', '7BCQ_restr_config.yml', '7BCQ_tmd.pdb', '7BCQ_tmd.tmscore.txt', '7BCQ_tmd_raw.pdb', '7BCQ_tmd_tmd.dcd', '7BCQ_tmd_tmd.rst', '7BCQ_tmd_tmd_config.yml', '7bcq_fff.dcd', '7bcq_fff.pdb']
我们现在就可以对密度图进行识别并生成若干个蛋白片段。对密度图进行片段识别需要依赖很多信息,包括原子的概率,位置和氨基酸类型,以及伪肽键向量(pseudo peptide vector)。
/opt/conda/envs/dpemm/lib/python3.9/site-packages/torch/nn/functional.py:3704: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.") (64, 64, 64) -> (64, 64, 64) 0 32 match domain from given fasta: /demo/fasta/7BCQ.fasta num_residue: 236 num_domain: 18 domain mean length: 13.11111111111111 /data/fff_demo/output/7BCQ_infer.txt /data/fff_demo/output/7BCQ_infer.cif /data/fff_demo/output/7BCQ_infer.pdb /data/fff_demo/output/7BCQ_infer_backbone.mrc fff infer --output-txt /data/fff_demo/output/7BCQ_infer.txt --output-cif /data/fff_demo/output/7BCQ_infer.cif --output-pdb /data/fff_demo/output/7BCQ_infer.pdb --input-config /ckpt/train_config.json --input-weights /ckpt/fffw_304000.pt --input-raw-map /data/fff_demo/output/7BCQ.apix1_apix_map.mrc --input-std-map /data/fff_demo/output/7BCQ.apix1_std_map.mrc --input-fasta /demo/fasta/7BCQ.fasta --output-backbone-map /data/fff_demo/output/7BCQ_infer_backbone.mrc --confidence 0.3 --length-cutoff 2 --device 0
['7BCQ.apix1.ccp4', '7BCQ.apix1_apix_map.mrc', '7BCQ.apix1_res_3.0.dx', '7BCQ.apix1_std_map.mrc', '7BCQ_clean.pdb', '7BCQ_clean_chain.pdb', '7BCQ_clean_clean.pdb', '7BCQ_clean_no_hetero.pdb', '7BCQ_cmd.dcd', '7BCQ_cmd.pdb', '7BCQ_cmd.rst', '7BCQ_cmd.tmscore.txt', '7BCQ_cmd_cmd_config.yml', '7BCQ_infer.cif', '7BCQ_infer.pdb', '7BCQ_infer.txt', '7BCQ_infer_backbone.mrc', '7BCQ_restr.exb', '7BCQ_restr_config.yml', '7BCQ_tmd.pdb', '7BCQ_tmd.tmscore.txt', '7BCQ_tmd_raw.pdb', '7BCQ_tmd_tmd.dcd', '7BCQ_tmd_tmd.rst', '7BCQ_tmd_tmd_config.yml', '7bcq_fff.dcd', '7bcq_fff.pdb']
展示的是FFF预测的主干密度图(灰)与输入密度图(浅蓝)的对照,主干密度图表示每个voxel属于主干原子(C, C, N)的概率。
2. 蛋白全原子结构搭建
Finding missing atoms... Adding missing atoms... Writing output... Done. Load PDB... Done. Re-organize chain id... Done. Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead. Finding missing residues... Finding nonstandard residues... Replacing nonstandard residues... Finding missing atoms... Adding missing atoms... Adding missing hydrogens... Writing output... Done. Find GB force Minimization... Before: 437100.15625 kJ/mol -20948833.0407383 kJ/(nm mol) After: -54571.84375 kJ/mol -1248.0090580619872 kJ/(nm mol) Done.
dpems grid --input /data/fff_demo/output/7BCQ.apix1.ccp4 --output /data/fff_demo/output/7BCQ.apix1_res_3.0.dx --rinp 3 --rout 3.0 >> input is a ccp4 file: /data/fff_demo/output/7BCQ.apix1.ccp4 >> origin from /data/fff_demo/output/7BCQ.apix1.ccp4: [55.65999806 86.019997 72.86399746] GRID SIZE: 65 x 65 x 65 >> output grid origin [55.65999806 86.019997 72.86399746]
dpems optstruc --configure /data/fff_demo/output/7BCQ_restr_config.yml PLATFORM: CUDA Writing SSrestraint Writing CHIRALrestraint Writing CISrestraint restraint file: /data/fff_demo/output/7BCQ_restr.exb
dpems tmd --init-pdb /data/fff_demo/output/7BCQ_clean.pdb --restraint /data/fff_demo/output/7BCQ_restr.exb --coupling-config /data/fff_demo/output/7BCQ_tmd_tmd_config.yml --output-restart /data/fff_demo/output/7BCQ_tmd_tmd.rst --output-dcd /data/fff_demo/output/7BCQ_tmd_tmd.dcd --output-pdb /data/fff_demo/output/7BCQ_tmd_raw.pdb --output-pdb-aligned /data/fff_demo/output/7BCQ_tmd.pdb --temperature 10 --nsteps 12000 --traj-freq 1000 --report-freq 1000 --tmd-update-freq 1000 --platform CUDA @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. >>> 236 atoms selected for TMD >>> initial RMSD: 8.48 A Stage 1: gamma = 0.92 #"Step","Potential Energy (kJ/mole)","Temperature (K)","Density (g/mL)","Speed (ns/day)","Time Remaining" 1000,-55395.15953086443,10.935697480964484,9.625561292924427,0,-- @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. [stage 1] >>> save /data/fff_demo/output/7BCQ_tmd_raw.pdb [stage 1] >>> save /data/fff_demo/output/7BCQ_tmd.pdb [stage 1] >>> rmsd: 7.80 A (gamma = 0.9166666666666666) Stage 2: gamma = 0.83 2000,-55288.67541526787,12.075340646822927,9.625561292924427,69.1,0:25 @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. [stage 2] >>> save /data/fff_demo/output/7BCQ_tmd_raw.pdb [stage 2] >>> save /data/fff_demo/output/7BCQ_tmd.pdb [stage 2] >>> rmsd: 7.13 A (gamma = 0.8333333333333334) Stage 3: gamma = 0.75 3000,-55083.09274333183,11.651242182053384,9.625561292924427,71.2,0:21 @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. [stage 3] >>> save /data/fff_demo/output/7BCQ_tmd_raw.pdb [stage 3] >>> save /data/fff_demo/output/7BCQ_tmd.pdb [stage 3] >>> rmsd: 6.46 A (gamma = 0.75) Stage 4: gamma = 0.67 4000,-54961.81337345173,12.08808337728884,9.625561292924427,72.6,0:19 @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. [stage 4] >>> save /data/fff_demo/output/7BCQ_tmd_raw.pdb [stage 4] >>> save /data/fff_demo/output/7BCQ_tmd.pdb [stage 4] >>> rmsd: 5.77 A (gamma = 0.6666666666666667) Stage 5: gamma = 0.58 5000,-54775.03003960011,12.387954444093356,9.625561292924427,72.7,0:16 @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. [stage 5] >>> save /data/fff_demo/output/7BCQ_tmd_raw.pdb [stage 5] >>> save /data/fff_demo/output/7BCQ_tmd.pdb [stage 5] >>> rmsd: 5.05 A (gamma = 0.5833333333333333) Stage 6: gamma = 0.50 6000,-54542.08228012238,11.673589617894702,9.625561292924427,73.3,0:14 @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. [stage 6] >>> save /data/fff_demo/output/7BCQ_tmd_raw.pdb [stage 6] >>> save /data/fff_demo/output/7BCQ_tmd.pdb [stage 6] >>> rmsd: 4.34 A (gamma = 0.5) Stage 7: gamma = 0.42 7000,-54390.491832383756,12.691338374008142,9.625561292924427,73.3,0:11 @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. [stage 7] >>> save /data/fff_demo/output/7BCQ_tmd_raw.pdb [stage 7] >>> save /data/fff_demo/output/7BCQ_tmd.pdb [stage 7] >>> rmsd: 3.62 A (gamma = 0.41666666666666663) Stage 8: gamma = 0.33 8000,-54243.49506762587,12.286542322856695,9.625561292924427,73.4,0:09 @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. [stage 8] >>> save /data/fff_demo/output/7BCQ_tmd_raw.pdb [stage 8] >>> save /data/fff_demo/output/7BCQ_tmd.pdb [stage 8] >>> rmsd: 2.92 A (gamma = 0.33333333333333337) Stage 9: gamma = 0.25 9000,-54145.08429337973,12.500200522709907,9.625561292924427,73.6,0:07 @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. [stage 9] >>> save /data/fff_demo/output/7BCQ_tmd_raw.pdb [stage 9] >>> save /data/fff_demo/output/7BCQ_tmd.pdb [stage 9] >>> rmsd: 2.22 A (gamma = 0.25) Stage 10: gamma = 0.17 10000,-53847.063096414015,12.377334237655072,9.625561292924427,73.9,0:04 @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. [stage 10] >>> save /data/fff_demo/output/7BCQ_tmd_raw.pdb [stage 10] >>> save /data/fff_demo/output/7BCQ_tmd.pdb [stage 10] >>> rmsd: 1.62 A (gamma = 0.16666666666666663) Stage 11: gamma = 0.08 11000,-53416.41181881514,12.860831477990114,9.625561292924427,73.9,0:02 @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. [stage 11] >>> save /data/fff_demo/output/7BCQ_tmd_raw.pdb [stage 11] >>> save /data/fff_demo/output/7BCQ_tmd.pdb [stage 11] >>> rmsd: 1.01 A (gamma = 0.08333333333333337) Stage 12: gamma = 0.00 12000,-52730.25720070057,14.061296279725285,9.625561292924427,73.9,0:00 @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. [stage 12] >>> save /data/fff_demo/output/7BCQ_tmd_raw.pdb [stage 12] >>> save /data/fff_demo/output/7BCQ_tmd.pdb [stage 12] >>> rmsd: 0.57 A (gamma = 0.0)
{'cmd_k': 10000.0, 'cmd_selection': 'name CA', 'cmd_steps_per_stage': 500, 'cmd_total_stages': 2, 'gpu_device': 0, 'input_pdb_init': '/data/fff_demo/output/7BCQ_tmd.pdb', 'input_pdb_target': '/data/fff_demo/output/7BCQ_infer.pdb', 'input_restr': '/data/fff_demo/output/7BCQ_restr.exb', 'output_dcd': '/data/fff_demo/output/7BCQ_cmd.dcd', 'output_pdb': '/data/fff_demo/output/7BCQ_cmd.pdb', 'output_rst': None, 'platform': 'CUDA', 'report_freq': 500, 'temperature': 10.0, 'total_steps': 5000, 'traj_freq': 500} dpems cmd --input-pdb /data/fff_demo/output/7BCQ_tmd.pdb --coupling-config /data/fff_demo/output/7BCQ_cmd_cmd_config.yml --output-restart /data/fff_demo/output/7BCQ_cmd.rst --output-dcd /data/fff_demo/output/7BCQ_cmd.dcd --output-pdb /data/fff_demo/output/7BCQ_cmd.pdb --temperature 10.0 --total-steps 5000 --traj-freq 500 --report-freq 500 --cmd-total-stages 2 --cmd-steps-per-stage 500 --platform CUDA --debug --restraint /data/fff_demo/output/7BCQ_restr.exb @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.07s. CREATE BIAS USING MAP: /data/fff_demo/output/7BCQ.apix1_res_3.0.dx INPUTMAP: 64 x 64 x 64 CREATEMAP: 64 x 64 x 64 >>> MDFF biases: [<openmm.openmm.CustomCompoundBondForce; proxy of <Swig Object of type 'OpenMM::CustomCompoundBondForce *' at 0x7f23b196ec90> >] >>> All biases: [<openmm.openmm.CustomExternalForce; proxy of <Swig Object of type 'OpenMM::CustomExternalForce *' at 0x7f23b1945630> >, <openmm.openmm.CustomCompoundBondForce; proxy of <Swig Object of type 'OpenMM::CustomCompoundBondForce *' at 0x7f23b196ec90> >] >>> Add restraints (SS, cis, chiral) using "/data/fff_demo/output/7BCQ_restr.exb" CMD Stage 1: gamma: 0.500 (236 atoms restrained) #"Step","Potential Energy (kJ/mole)","Temperature (K)","Density (g/mL)","Speed (ns/day)","Time Remaining" 500,-97090.2734375,11.759734960739115,9.625561292924427,0,-- @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. >>> RMSD: 0.72 Å >>> atom 82: xyz=[ 6.894629 11.00670433 8.39646816] nm; sys: xyz=[ 6.8741 11.0359 8.3503] nm; ref: xyz=[ 6.9458 10.9505 8.478 ] nm; CMD Stage 2: gamma: 1.000 (236 atoms restrained) 1000,-95746.671875,13.16433019533872,9.625561292924427,70.8,0:09 @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.06s. >>> RMSD: 0.65 Å >>> atom 82: xyz=[ 6.92476416 10.98426247 8.44348335] nm; sys: xyz=[ 6.8741 11.0359 8.3503] nm; ref: xyz=[ 6.9458 10.9505 8.478 ] nm; >>> Run MD with constraints (4000 steps to go; 236 atoms restrained) 1500,-96080.484375,12.115721037961649,9.625561292924427,70.2,0:08 2000,-96188.5859375,10.847192694100114,9.625561292924427,97.9,0:05 2500,-96217.4140625,10.104542208772175,9.625561292924427,122,0:03 3000,-96229.640625,10.122088472905846,9.625561292924427,143,0:02 3500,-96230.3203125,10.066886989533359,9.625561292924427,162,0:01 4000,-96215.984375,9.754862200097236,9.625561292924427,176,0:00 4500,-96229.625,9.944912422566823,9.625561292924427,191,0:00 5000,-96230.578125,9.86418352907686,9.625561292924427,204,0:00 @> 236 atoms and 1 coordinate set(s) were parsed in 0.00s. @> 6701 atoms and 1 coordinate set(s) were parsed in 0.07s. >>> RMSD: 0.65 Å >>> atom 82: xyz=[ 6.9252367 10.98064423 8.4386816 ] nm; sys: xyz=[ 6.8741 11.0359 8.3503] nm; ref: xyz=[ 6.9458 10.9505 8.478 ] nm; Done! >>> Total number of steps: 5000 >>> output pdb: /data/fff_demo/output/7BCQ_cmd.pdb Time elapsed: 12.634897708892822 s
3. 预测和已发表结构的比较
Intermediate TM Score (after TMD)) TM-score = 0.8989 (d0= 7.54) ----------------- Final TM score (after CMD) TM-score = 0.9096 (d0= 7.54) -----------------
>>> output pdb: /data/fff_demo/output/7bcq_fff.pdb >>> output dcd: /data/fff_demo/output/7bcq_fff.dcd
冷冻电镜全原子模型结构搭建的方法虽然很多,但是对于中等分辨率的电镜密度图做准确且自动的结构搭建仍然是个挑战。FFF通过结合计算机视觉领域的三维识别算法和计算模拟领域的分子动态模拟技术实现了对蛋白结构的自动化搭建,而且在精准程度上超过了传统方法以及蛋白结构预测方法。未来FFF算法将被扩展到DNA/RNA/小分子的结构搭建上。此外,我们开发了基于FFF 算法的App (https://app.bohrium.dp.tech/fff), 以方便更多人能将FFF 应用到自己的cryo-EM 数据处理工作流当中。FFF App 的详细使用说明可以参考这个文档。
- Garaeva, A.A., Guskov, A., Slotboom, D.J. et al. A one-gate elevator mechanism for the human neutral amino acid transporter ASCT2. Nat Commun 10, 3427 (2019)
- Garibsingh RA, Ndaru E, Garaeva AA, Shi Y, Zielewicz L, Zakrepine P, Bonomi M, Slotboom DJ, Paulino C, Grewer C, Schlessinger A. Rational design of ASCT2 inhibitors using an integrated experimental-computational approach. Proc. Natl. Acad. Sci. (U. S. A.) 118:e2104093118. (2021)
- Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021)
- Trabuco LG, Villa E, Mitra K, Frank J, Schulten K. Flexible fitting of atomic structures into electron microscopy maps using molecular dynamics. Structure. 16:673-83 (2008)
- Weijie Chen, Xinyan Wang, and Yuhang Wang. FFF: Fragment-Guided Flexible Fitting for Building Complete Protein Structures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023 pp. 19776-19785 (2023)