DPA2上机实战
©️ Copyright 2023 @ Authors
作者: 张成谦📨
日期:2023-11-30
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:你可以点击界面上方蓝色按钮 开始连接 ,选择 `dpa2-tutorial:v3` 镜像及`c4_m15_1 * NVIDIA T4`节点配置,稍等片刻即可运行。
💭在阅读本notebook之前,建议读者先阅读DeePMD和DPA-1的相关教程,本文将不再详细介绍输入文件中的各项参数的含义:
1. 学习目标
在本教程学习后,你将获得:
- DPA-2基本原理和应用;
- 以H2O-SCAN0数据集为例,进行DPA-2势函数模型训练实战:输入脚本解读;从头训 vs. 已有预训练模型微调;
2. DPA-2简介
👂 迫不及待动手实践?可直接跳转至第3节~
2.1 研究背景
一直以来,势函数训练都在追求精度和效率的平衡。使用经典力场势函数方便快捷,但模拟精度难以更上一层楼;使用近来火热的AIMD(从头算分子动力学),势函数精度获得大幅提升,但计算资源花费难以在大体系、长时间的场景落地。随着AI for science的发展,机器学习手段使得训练高精度、高效率的势函数成为可能(下图:分子动力学模拟对比)。在MLMD的新范式下,量子化学计算(QM)不再直接应用于AIMD,而是作为生成机器学习势函数(MLP)的数据集准备; 当然,AIMD的计算结果也可以作为初始数据集。
然而,由于现有模型迁移能力不足、缺乏通用的大模型,面对一个新的复杂体系,要获得可用的、较为完备的势函数模型,科学家们基本上仍然需要获取大量计算数据并从头开始训练模型。随着电子结构数据的积累,类比计算机视觉(CV)或者自然语言处理(NLP)等其他人工智能领域的发展,**“预训练+少量数据微调”**是解决这个难题比较自然的想法。
为了实现这一范式,我们亟需一种具有强迁移能力、能容纳元素周期表大多数元素的模型结构。
2.2 研究方法
DPA-2模型是继DPA-1之后对DP系列模型的又一次全面升级:
一方面是DPA-2使用了多任务训练策略(multi-task training),可以同时使用DFT设置不同的多个数据集进行预训练,在对下游任务进行微调(finetune)时,模型的主干(将构型空间和化学空间的表示进行编码的部分)将被保留,并在后边连接到一个或多个头部网络。因此,预训练的数据集和微调的数据集的标注方法不必完全相同。经过微调得到的模型参数量较多,直接应用于生产场景(如MD模拟)时可能会导致效率低下,为了解决这一问题,我们可以将模型蒸馏(distillation)为一个参数量较少的模型,既能保持下游任务的准确性,又能提高速度,从而可以进行大规模的模拟。(Figure1:预训练,微调,蒸馏的流程图)
另一方面,DPA-2进一步改进了模型结构,对原子间的相互作用实现了更为充分的建模,通过在合金、半导体、电池材料和药物分子等数据集上的预训练,能够学习到更多隐藏的原子交互信息,极大提升了模型在包含不同构象、不同组分的数据集之间的迁移能力。我们将模型在18个不同的数据集上进行了预训练,并将此预训练模型在各种下游任务上进行了迁移学习,实验表明,与DPA-1模型相比,DPA-2预训练模型能进一步大幅降低下游任务训练所需数据量及训练成本、提高模型预测精度(Figure2:DPA-2模型结构示意图)。
2.3 实验验证
数据集概览
我们将各个数据集在预训练模型中的表示(descriptor)可视化(使用t-SNE方法),结果如下图所示:
下游任务微调
研究者利用下游数据集对 DPA-1、DPA-2 和 multitask pretrained DPA-2模型进行了采样效率测试,观察随着下游任务数据量的增加,收敛之后的能量和力的RMSE的变化趋势(Figure 3: 下游数据集的结果)。
从图中可以看出:
1.从头训练的DPA-2模型比DPA-1模型有更好的收敛精度,尤其是在数据量足够大的情况下,这一结果凸显了DPA-2模型结构的优越性。
2.当进行多任务预训练(multitask pretrain)时,经过微调的DPA-2模型可能会产生比从头训练的DPA-2模型低得多的曲线,尤其是在下游数据有限的情况下。在 H2O -SCAN0 等数据集上,即使是零点均方根误差(RMSE)也足够精确。
3. DPA2实战
学习了理论知识后,让我们直接动手实践吧! 本节,我们将以数据集H20-SCAN0为例,开展DPA-2的从头训和微调训练。
注:本教程所使用的数据集源自科学智能广场(AIS-Square),有更多模型和数据需求的同学赶快去探索一下吧~
3.1 内容介绍
/root/dpa2
. ├── data │ └── H2O-scan0 ├── finetune │ └── input.json ├── from_scratch │ └── input.json └── pretrain_model └── model.pt 5 directories, 3 files
让我们一起看看教程文档包含的内容:
data: H2O-SCAN0数据集
finetune: 基于预训练模型的微调训练的目录,input.json为输入文件
from_scratch: DPA-2从头训练的目录,input.json为输入文件
pretrain_model: 预训练模型所在的目录,model.pt为预训练模型
3.2 输入脚本准备
从头训和模型微调训练的输入文件在此例子下是完全相同的:
{ "_comment": "that's all", "model": { "type_embedding": { "neuron": [ 8 ], "tebd_input_mode": "concat" }, "type_map": [ "H", "He", "Li", "Be", "B", "C", "N", "O", "F", "Ne", "Na", "Mg", "Al", "Si", "P", "S", "Cl", "Ar", "K", "Ca", "Sc", "Ti", "V", "Cr", "Mn", "Fe", "Co", "Ni", "Cu", "Zn", "Ga", "Ge", "As", "Se", "Br", "Kr", "Rb", "Sr", "Y", "Zr", "Nb", "Mo", "Tc", "Ru", "Rh", "Pd", "Ag", "Cd", "In", "Sn", "Sb", "Te", "I", "Xe", "Cs", "Ba", "La", "Ce", "Pr", "Nd", "Pm", "Sm", "Eu", "Gd", "Tb", "Dy", "Ho", "Er", "Tm", "Yb", "Lu", "Hf", "Ta", "W", "Re", "Os", "Ir", "Pt", "Au", "Hg", "Tl", "Pb", "Bi", "Po", "At", "Rn", "Fr", "Ra", "Ac", "Th", "Pa", "U", "Np", "Pu", "Am", "Cm", "Bk", "Cf", "Es", "Fm", "Md", "No", "Lr", "Rf", "Db", "Sg", "Bh", "Hs", "Mt", "Ds", "Rg", "Cn", "Nh", "Fl", "Mc", "Lv", "Ts", "Og" ], "descriptor": { "type": "hybrid", "hybrid_mode": "sequential", "list": [ { "type": "se_atten", "sel": 120, "rcut_smth": 8.0, "rcut": 9.0, "neuron": [ 25, 50, 100 ], "resnet_dt": false, "axis_neuron": 12, "seed": 1, "attn": 128, "attn_layer": 0, "attn_dotr": true, "attn_mask": false, "post_ln": true, "ffn": false, "ffn_embed_dim": 1024, "activation": "tanh", "scaling_factor": 1.0, "head_num": 1, "normalize": true, "temperature": 1.0, "add": "concat", "pre_add": true, "_comment": " that's all" }, { "type": "se_uni", "sel": 40, "rcut_smth": 3.5, "rcut": 4.0, "nlayers": 12, "g1_dim": 128, "g2_dim": 32, "attn2_hidden": 32, "attn2_nhead": 4, "attn1_hidden": 128, "attn1_nhead": 4, "axis_dim": 4, "update_h2": false, "update_g1_has_conv": true, "update_g1_has_grrg": true, "update_g1_has_drrd": true, "update_g1_has_attn": true, "update_g2_has_g1g1": true, "update_g2_has_attn": true, "attn2_has_gate": true, "add_type_ebd_to_seq": false, "smooth": true, "_comment": " that's all" } ] }, "fitting_net": { "neuron": [ 240, 240, 240 ], "resnet_dt": true, "seed": 1, "_comment": " that's all" }, "_comment": " that's all" }, "learning_rate": { "type": "exp", "decay_steps": 1, "start_lr": 0.0002, "stop_lr": 3.51e-08, "_comment": "that's all" }, "loss": { "type": "ener", "start_pref_e": 0.02, "limit_pref_e": 1, "start_pref_f": 1000, "limit_pref_f": 1, "start_pref_v": 0, "limit_pref_v": 0, "_comment": " that's all" }, "training": { "training_data": { "systems": [ "/root/dpa2/data/H2O-scan0/data1/train", "/root/dpa2/data/H2O-scan0/data10/train", "/root/dpa2/data/H2O-scan0/data11/train", "/root/dpa2/data/H2O-scan0/data12/train", "/root/dpa2/data/H2O-scan0/data13/train", "/root/dpa2/data/H2O-scan0/data14/train", "/root/dpa2/data/H2O-scan0/data15/train", "/root/dpa2/data/H2O-scan0/data16/train", "/root/dpa2/data/H2O-scan0/data17/train", "/root/dpa2/data/H2O-scan0/data18/train", "/root/dpa2/data/H2O-scan0/data19/train", "/root/dpa2/data/H2O-scan0/data2/train", "/root/dpa2/data/H2O-scan0/data20/train", "/root/dpa2/data/H2O-scan0/data21/train", "/root/dpa2/data/H2O-scan0/data22/train", "/root/dpa2/data/H2O-scan0/data23/train", "/root/dpa2/data/H2O-scan0/data24/train", "/root/dpa2/data/H2O-scan0/data25/train", "/root/dpa2/data/H2O-scan0/data26/train", "/root/dpa2/data/H2O-scan0/data27/train", "/root/dpa2/data/H2O-scan0/data28/train", "/root/dpa2/data/H2O-scan0/data29/train", "/root/dpa2/data/H2O-scan0/data3/train", "/root/dpa2/data/H2O-scan0/data30/train", "/root/dpa2/data/H2O-scan0/data31/train", "/root/dpa2/data/H2O-scan0/data32/train", "/root/dpa2/data/H2O-scan0/data33/train", "/root/dpa2/data/H2O-scan0/data34/train", "/root/dpa2/data/H2O-scan0/data35/train", "/root/dpa2/data/H2O-scan0/data36/train", "/root/dpa2/data/H2O-scan0/data37/train", "/root/dpa2/data/H2O-scan0/data38/train", "/root/dpa2/data/H2O-scan0/data39/train", "/root/dpa2/data/H2O-scan0/data4/train", "/root/dpa2/data/H2O-scan0/data40/train", "/root/dpa2/data/H2O-scan0/data41/train", "/root/dpa2/data/H2O-scan0/data42/train", "/root/dpa2/data/H2O-scan0/data43/train", "/root/dpa2/data/H2O-scan0/data44/train", "/root/dpa2/data/H2O-scan0/data45/train", "/root/dpa2/data/H2O-scan0/data5/train", "/root/dpa2/data/H2O-scan0/data6/train", "/root/dpa2/data/H2O-scan0/data7/train", "/root/dpa2/data/H2O-scan0/data8/train", "/root/dpa2/data/H2O-scan0/data9/train", "/root/dpa2/data/H2O-scan0/data_ex1/train", "/root/dpa2/data/H2O-scan0/data_ex2/train", "/root/dpa2/data/H2O-scan0/data_ex3/train", "/root/dpa2/data/H2O-scan0/data_ex4/train", "/root/dpa2/data/H2O-scan0/data_ex5/train", "/root/dpa2/data/H2O-scan0/data_ex6/train" ], "batch_size": "auto", "_comment": "that's all" }, "validation_data": { "systems": [ "/root/dpa2/data/H2O-scan0/data1/valid", "/root/dpa2/data/H2O-scan0/data10/valid", "/root/dpa2/data/H2O-scan0/data11/valid", "/root/dpa2/data/H2O-scan0/data12/valid", "/root/dpa2/data/H2O-scan0/data13/valid", "/root/dpa2/data/H2O-scan0/data14/valid", "/root/dpa2/data/H2O-scan0/data15/valid", "/root/dpa2/data/H2O-scan0/data16/valid", "/root/dpa2/data/H2O-scan0/data17/valid", "/root/dpa2/data/H2O-scan0/data18/valid", "/root/dpa2/data/H2O-scan0/data19/valid", "/root/dpa2/data/H2O-scan0/data2/valid", "/root/dpa2/data/H2O-scan0/data20/valid", "/root/dpa2/data/H2O-scan0/data21/valid", "/root/dpa2/data/H2O-scan0/data22/valid", "/root/dpa2/data/H2O-scan0/data23/valid", "/root/dpa2/data/H2O-scan0/data24/valid", "/root/dpa2/data/H2O-scan0/data25/valid", "/root/dpa2/data/H2O-scan0/data26/valid", "/root/dpa2/data/H2O-scan0/data27/valid", "/root/dpa2/data/H2O-scan0/data28/valid", "/root/dpa2/data/H2O-scan0/data29/valid", "/root/dpa2/data/H2O-scan0/data3/valid", "/root/dpa2/data/H2O-scan0/data30/valid", "/root/dpa2/data/H2O-scan0/data31/valid", "/root/dpa2/data/H2O-scan0/data32/valid", "/root/dpa2/data/H2O-scan0/data33/valid", "/root/dpa2/data/H2O-scan0/data34/valid", "/root/dpa2/data/H2O-scan0/data35/valid", "/root/dpa2/data/H2O-scan0/data36/valid", "/root/dpa2/data/H2O-scan0/data37/valid", "/root/dpa2/data/H2O-scan0/data38/valid", "/root/dpa2/data/H2O-scan0/data39/valid", "/root/dpa2/data/H2O-scan0/data4/valid", "/root/dpa2/data/H2O-scan0/data40/valid", "/root/dpa2/data/H2O-scan0/data41/valid", "/root/dpa2/data/H2O-scan0/data42/valid", "/root/dpa2/data/H2O-scan0/data43/valid", "/root/dpa2/data/H2O-scan0/data44/valid", "/root/dpa2/data/H2O-scan0/data45/valid", "/root/dpa2/data/H2O-scan0/data5/valid", "/root/dpa2/data/H2O-scan0/data6/valid", "/root/dpa2/data/H2O-scan0/data7/valid", "/root/dpa2/data/H2O-scan0/data8/valid", "/root/dpa2/data/H2O-scan0/data9/valid", "/root/dpa2/data/H2O-scan0/data_ex1/valid", "/root/dpa2/data/H2O-scan0/data_ex2/valid", "/root/dpa2/data/H2O-scan0/data_ex3/valid", "/root/dpa2/data/H2O-scan0/data_ex4/valid", "/root/dpa2/data/H2O-scan0/data_ex5/valid", "/root/dpa2/data/H2O-scan0/data_ex6/valid" ], "batch_size": 1, "_comment": "that's all" }, "numb_steps": 200, "warmup_steps": 0, "gradient_max_norm": 5.0, "seed": 10, "disp_file": "lcurve.out", "disp_freq": 25, "save_freq": 200, "_comment": "that's all", "wandb_config": { "wandb_enabled": false, "entity": "dp_model_engineering", "project": "DPA" } } }
相比于DPA-1模型,DPA-2主要不同的参数集中在descriptor
部分
"descriptor": {
"type": "hybrid",
"hybrid_mode": "sequential",
"list": [
{
"type": "se_atten",
"sel": 120,
"rcut_smth": 8.0,
"rcut": 9.0,
"neuron": [
25,
50,
100
],
"resnet_dt": false,
"axis_neuron": 12,
"seed": 1,
"attn": 128,
"attn_layer": 0,
"attn_dotr": true,
"attn_mask": false,
"post_ln": true,
"ffn": false,
"ffn_embed_dim": 1024,
"activation": "tanh",
"scaling_factor": 1.0,
"head_num": 1,
"normalize": true,
"temperature": 1.0,
"add": "concat",
"pre_add": true,
"_comment": " that's all"
},
{
"type": "se_uni",
"sel": 40,
"rcut_smth": 3.5,
"rcut": 4.0,
"nlayers": 12,
"g1_dim": 128,
"g2_dim": 32,
"attn2_hidden": 32,
"attn2_nhead": 4,
"attn1_hidden": 128,
"attn1_nhead": 4,
"axis_dim": 4,
"update_h2": false,
"update_g1_has_conv": true,
"update_g1_has_grrg": true,
"update_g1_has_drrd": true,
"update_g1_has_attn": true,
"update_g2_has_g1g1": true,
"update_g2_has_attn": true,
"attn2_has_gate": true,
"add_type_ebd_to_seq": false,
"smooth": true,
"_comment": " that's all"
}
]
},
相比于之前大家常用的se_e2_a描述子和DPA-1模型的se_atten描述子来说,DPA-2采用了se_attn和se_uni的混合描述子,其中两个描述子为串联关系,即se_attn描述子的输出作为se_uni描述子的输入。具体的模型构造可参考Figure 2。
3.3 模型训练(从头训)
/root/dpa2/from_scratch
2023-12-01 11:12:25.880822: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-12-01 11:12:25.880908: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-12-01 11:12:25.984780: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-12-01 11:12:26.194761: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-01 11:12:27.714354: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT WARNING:tensorflow:From /opt/mamba/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. 2023-12-01 11:12:30,608 [main.py:170] INFO DeepMD version: 0.1.3.dev254+g24d5796 2023-12-01 11:12:30,610 [main.py:133] INFO Configuration path: input.json 2023-12-01 11:12:30,843 [stat.py:37] INFO Packing data for statistics from 51 systems 100%|███████████████████████████████████████████| 51/51 [00:10<00:00, 4.83it/s] 2023-12-01 11:12:41,411 [dataloader.py:279] INFO Generated weighted sampler with prob array: [0.05427021 0.01356755 0.01356755 0.01356755 0.0271351 0.0271351 0.01356755 0.04027421 0.03484719 0.01242502 0.01356755 0.0271351 0.02442159 0.02142245 0.02113682 0.01756641 0.013996 0.00671237 0.00585547 0.0087118 0.0061411 0.00542702 0.01856612 0.020994 0.01628106 0.01685233 0.01799486 0.01628106 0.01085404 0.01985147 0.0165667 0.0157098 0.01899457 0.01356755 0.01628106 0.01628106 0.0524136 0.03227649 0.04070266 0.02913453 0.01342474 0.01328192 0.01356755 0.01356755 0.01356755 0.05384176 0.01356755 0.01356755 0.01356755 0.01342474 0.01256784] 2023-12-01 11:12:41,412 [dataloader.py:279] INFO Generated weighted sampler with prob array: [0.05763689 0.01440922 0.01440922 0.01440922 0.02881844 0.02881844 0.01440922 0.04034582 0.03458213 0.01152738 0.01440922 0.02881844 0.0259366 0.02017291 0.02017291 0.01729107 0.01440922 0.00576369 0.00576369 0.00864553 0.00576369 0.00288184 0.01729107 0.02017291 0.01729107 0.01729107 0.01729107 0.01440922 0.01152738 0.02017291 0.01729107 0.01440922 0.01729107 0.01440922 0.01440922 0.01729107 0.05475504 0.03170029 0.04034582 0.02881844 0.01152738 0.01152738 0.01440922 0.01440922 0.01440922 0.05475504 0.01152738 0.01440922 0.01440922 0.01152738 0.01152738] 2023-12-01 11:12:47,991 [model.py:53] INFO Saving stat file to stat_files/stat_file_rcut9.00_smth8.00_sel120_se_atten.npz 2023-12-01 11:12:47,994 [model.py:53] INFO Saving stat file to stat_files/stat_file_rcut4.00_smth3.50_sel40_se_uni.npz 2023-12-01 11:12:48,019 [ener.py:45] INFO Set seed to 1 in fitting net. 2023-12-01 11:12:48,055 [training.py:359] INFO Start to train 200 steps. 2023-12-01 11:12:51,775 [training.py:510] INFO step=0, lr=2.00e-04 loss=4719.2101, rmse_train=68.6965, rmse_e_train=0.9473, rmse_f_train=2.1716, rmse_valid=59.3087, rmse_e_valid=0.7322, rmse_f_valid=1.8750, speed=3.72 s/1 batches 2023-12-01 11:12:59,114 [training.py:510] INFO step=25, lr=6.79e-05 loss=1148.5740, rmse_train=33.8906, rmse_e_train=0.2380, rmse_f_train=1.8324, rmse_valid=49.8568, rmse_e_valid=0.2527, rmse_f_valid=2.6997, speed=7.34 s/25 batches 2023-12-01 11:13:06,258 [training.py:510] INFO step=50, lr=2.30e-05 loss=149.6035, rmse_train=12.2312, rmse_e_train=0.0317, rmse_f_train=1.1351, rmse_valid=13.1363, rmse_e_valid=0.0566, rmse_f_valid=1.2178, speed=7.14 s/25 batches 2023-12-01 11:13:13,421 [training.py:510] INFO step=75, lr=7.81e-06 loss=25.6861, rmse_train=5.0681, rmse_e_train=0.0285, rmse_f_train=0.7989, rmse_valid=5.4680, rmse_e_valid=0.0385, rmse_f_valid=0.8605, speed=7.16 s/25 batches 2023-12-01 11:13:20,500 [training.py:510] INFO step=100, lr=2.65e-06 loss=10.3940, rmse_train=3.2240, rmse_e_train=0.0164, rmse_f_train=0.8524, rmse_valid=3.9693, rmse_e_valid=0.0351, rmse_f_valid=1.0442, speed=7.08 s/25 batches 2023-12-01 11:13:28,028 [training.py:510] INFO step=125, lr=8.99e-07 loss=4.2112, rmse_train=2.0521, rmse_e_train=0.0061, rmse_f_train=0.8751, rmse_valid=2.0618, rmse_e_valid=0.0061, rmse_f_valid=0.8792, speed=7.53 s/25 batches 2023-12-01 11:13:35,140 [training.py:510] INFO step=150, lr=3.05e-07 loss=1.7295, rmse_train=1.3151, rmse_e_train=0.0168, rmse_f_train=0.8149, rmse_valid=1.5815, rmse_e_valid=0.0261, rmse_f_valid=0.9694, speed=7.11 s/25 batches 2023-12-01 11:13:42,314 [training.py:510] INFO step=175, lr=1.03e-07 loss=1.3723, rmse_train=1.1715, rmse_e_train=0.0281, rmse_f_train=0.8972, rmse_valid=1.2020, rmse_e_valid=0.0001, rmse_f_valid=0.9760, speed=7.17 s/25 batches 2023-12-01 11:13:49,239 [training.py:529] INFO Saved model to model_200.pt 100%|█████████████████████████████████████████| 200/200 [01:01<00:00, 3.27it/s] 2023-12-01 11:13:49,245 [training.py:562] INFO Trained model has been saved to: model.pt
3.4 模型微调训练
/root/dpa2/finetune
微调训练的命令增加了--finetune ../pretrain_model/model.pt
选项,表示在模型../pretrain_model/model.pt
的基础上微调。
除此之外,命令还增加了-m H2O_H2O-PD
选项,原因是我们使用的数据集H2O-SCAN0与预训练数据集H2O_H2O-PD较为接近,所以我们使用预训练得到的H2O_H2O-PD的fitting net的参数来初始化我们微调的模型的fitting net的参数。
2023-12-01 11:14:28.921978: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-12-01 11:14:28.922049: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-12-01 11:14:28.923370: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-12-01 11:14:28.931242: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-01 11:14:29.940467: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT WARNING:tensorflow:From /opt/mamba/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. 2023-12-01 11:14:31,822 [main.py:170] INFO DeepMD version: 0.1.3.dev254+g24d5796 2023-12-01 11:14:31,824 [main.py:133] INFO Configuration path: input.json 2023-12-01 11:14:36,380 [finetune.py:55] INFO Change the model configurations according to the model branch H2O_H2O-PD in the pretrained one... 2023-12-01 11:14:36,580 [dataloader.py:279] INFO Generated weighted sampler with prob array: [0.05427021 0.01356755 0.01356755 0.01356755 0.0271351 0.0271351 0.01356755 0.04027421 0.03484719 0.01242502 0.01356755 0.0271351 0.02442159 0.02142245 0.02113682 0.01756641 0.013996 0.00671237 0.00585547 0.0087118 0.0061411 0.00542702 0.01856612 0.020994 0.01628106 0.01685233 0.01799486 0.01628106 0.01085404 0.01985147 0.0165667 0.0157098 0.01899457 0.01356755 0.01628106 0.01628106 0.0524136 0.03227649 0.04070266 0.02913453 0.01342474 0.01328192 0.01356755 0.01356755 0.01356755 0.05384176 0.01356755 0.01356755 0.01356755 0.01342474 0.01256784] 2023-12-01 11:14:36,580 [dataloader.py:279] INFO Generated weighted sampler with prob array: [0.05763689 0.01440922 0.01440922 0.01440922 0.02881844 0.02881844 0.01440922 0.04034582 0.03458213 0.01152738 0.01440922 0.02881844 0.0259366 0.02017291 0.02017291 0.01729107 0.01440922 0.00576369 0.00576369 0.00864553 0.00576369 0.00288184 0.01729107 0.02017291 0.01729107 0.01729107 0.01729107 0.01440922 0.01152738 0.02017291 0.01729107 0.01440922 0.01729107 0.01440922 0.01440922 0.01729107 0.05475504 0.03170029 0.04034582 0.02881844 0.01152738 0.01152738 0.01440922 0.01440922 0.01440922 0.05475504 0.01152738 0.01440922 0.01440922 0.01152738 0.01152738] 2023-12-01 11:14:37,717 [ener.py:45] INFO Set seed to 1 in fitting net. 2023-12-01 11:14:37,740 [training.py:238] INFO Resuming from ../pretrain_model/model.pt. 2023-12-01 11:14:38,291 [fitting.py:101] INFO Changing energy bias in pretrained model for types ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr', 'Rf', 'Db', 'Sg', 'Bh', 'Hs', 'Mt', 'Ds', 'Rg', 'Cn', 'Nh', 'Fl', 'Mc', 'Lv', 'Ts', 'Og']... (this step may take long time) 2023-12-01 11:14:38,424 [stat.py:37] INFO Packing data for statistics from 51 systems 100%|███████████████████████████████████████████| 51/51 [00:02<00:00, 17.08it/s] 2023-12-01 11:14:48,994 [fitting.py:166] INFO RMSE of atomic energy after linear regression is: 5.49232e-04 eV/atom. 2023-12-01 11:14:48,996 [fitting.py:178] INFO Change energy bias of ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr', 'Rf', 'Db', 'Sg', 'Bh', 'Hs', 'Mt', 'Ds', 'Rg', 'Cn', 'Nh', 'Fl', 'Mc', 'Lv', 'Ts', 'Og'] from [-6.338534 0. 0. 0. 0. 0. 0. -3.169267 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ] to [-187.04663 0. 0. 0. 0. 0. 0. -93.523315 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]. 2023-12-01 11:14:49,010 [training.py:359] INFO Start to train 200 steps. 2023-12-01 11:14:49,735 [training.py:510] INFO step=0, lr=2.00e-04 loss=4.6365, rmse_train=2.1533, rmse_e_train=0.0002, rmse_f_train=0.0681, rmse_valid=34.4455, rmse_e_valid=0.2252, rmse_f_valid=1.0892, speed=0.73 s/1 batches 2023-12-01 11:14:57,273 [training.py:510] INFO step=25, lr=6.79e-05 loss=12.2519, rmse_train=3.5003, rmse_e_train=0.0112, rmse_f_train=0.1897, rmse_valid=2.8736, rmse_e_valid=0.0033, rmse_f_valid=0.1558, speed=7.54 s/25 batches 2023-12-01 11:15:04,581 [training.py:510] INFO step=50, lr=2.30e-05 loss=1.9723, rmse_train=1.4044, rmse_e_train=0.0195, rmse_f_train=0.1282, rmse_valid=1.1252, rmse_e_valid=0.0116, rmse_f_valid=0.1035, speed=7.31 s/25 batches 2023-12-01 11:15:11,969 [training.py:510] INFO step=75, lr=7.81e-06 loss=0.3784, rmse_train=0.6151, rmse_e_train=0.0019, rmse_f_train=0.0972, rmse_valid=0.5964, rmse_e_valid=0.0039, rmse_f_valid=0.0939, speed=7.39 s/25 batches 2023-12-01 11:15:19,280 [training.py:510] INFO step=100, lr=2.65e-06 loss=0.1231, rmse_train=0.3508, rmse_e_train=0.0062, rmse_f_train=0.0902, rmse_valid=0.3211, rmse_e_valid=0.0017, rmse_f_valid=0.0849, speed=7.31 s/25 batches 2023-12-01 11:15:26,657 [training.py:510] INFO step=125, lr=8.99e-07 loss=0.0388, rmse_train=0.1970, rmse_e_train=0.0023, rmse_f_train=0.0830, rmse_valid=0.2276, rmse_e_valid=0.0014, rmse_f_valid=0.0968, speed=7.38 s/25 batches 2023-12-01 11:15:33,982 [training.py:510] INFO step=150, lr=3.05e-07 loss=0.0208, rmse_train=0.1442, rmse_e_train=0.0002, rmse_f_train=0.0907, rmse_valid=0.1755, rmse_e_valid=0.0020, rmse_f_valid=0.1092, speed=7.32 s/25 batches 2023-12-01 11:15:41,362 [training.py:510] INFO step=175, lr=1.03e-07 loss=0.0130, rmse_train=0.1139, rmse_e_train=0.0032, rmse_f_train=0.0853, rmse_valid=0.1112, rmse_e_valid=0.0014, rmse_f_valid=0.0889, speed=7.38 s/25 batches 2023-12-01 11:15:48,539 [training.py:529] INFO Saved model to model_200.pt 100%|█████████████████████████████████████████| 200/200 [00:59<00:00, 3.36it/s] 2023-12-01 11:15:48,545 [training.py:562] INFO Trained model has been saved to: model.pt
3.5 模型检验
现在,我们已经完成了DPA-2势函数的从头训和微调训练。让我们通过lcurve(学习率变化曲线输出文件)看下模型的表现吧!
📌注:由于节省时间,教程中的训练步仅为200步,没有到达一个收敛的结果;在真实训练场景中,记得调整至合适的训练时长~
/root/dpa2
想要直观对比从头训练和基于预训练模型的微调训练的lcurve差异?一键运行为大家准备好的可视化脚本试试~
根据lcurve的变化对比,我们可以看出基于预训练模型的微调训练具有更低的能量和力损失
4. 参考来源
DPA-2相关:
- DPA-2 paper in preparation
DPA-1相关:
- DPA-1: Pretraining of Attention-based Deep Potential Model for Molecular Simulation
- DeePMD-kit’s documentation
- 快速上手深度势能预训练模型 DPA-1
- DPA-1: 共建覆盖元素周期表的预训练大模型
- 张铎:DPA-1预训练模型介绍&上机实践
推荐阅读DPMD系列Notebook:
- 哥伦布训练营|DPA-1——固态电解质实战之模型训练&性质计算篇
- 快速开始 DeePMD-kit|训练甲烷深度势能分子动力学模型
- 从 DFT 到 MD|超详细「深度势能」材料计算上手指南
- 使用深度势能分子动力学进行固态电解质研究实战
参考文献
- Han Wang, Linfeng Zhang, Jiequn Han, and Weinan E. DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics. Comput. Phys. Comm., 228:178–184, 2018. doi:10.1016/j.cpc.2018.03.016.
- Jinzhe Zeng, Duo Zhang, Denghui Lu, Pinghui Mo, Zeyu Li, Yixiao Chen, Marián Rynik, Li'ang Huang, Ziyao Li, Shaochen Shi, Yingze Wang, Haotian Ye, Ping Tuo, Jiabin Yang, Ye Ding, Yifan Li, Davide Tisi, Qiyu Zeng, Han Bao, Yu Xia, Jiameng Huang, Koki Muraoka, Yibo Wang, Junhan Chang, Fengbo Yuan, Sigbjørn Løland Bore, Chun Cai, Yinnian Lin, Bo Wang, Jiayan Xu, Jia-Xin Zhu, Chenxing Luo, Yuzhi Zhang, Rhys E. A. Goodall, Wenshuo Liang, Anurag Kumar Singh, Sikai Yao, Jingchao Zhang, Renata Wentzcovitch, Jiequn Han, Jie Liu, Weile Jia, Darrin M. York, Weinan E, Roberto Car, Linfeng Zhang, and Han Wang. DeePMD-kit v2: A software package for Deep Potential models. 2023. doi:10.48550/arXiv.2304.09409.
- Huang J, Zhang L, Wang H, Zhao J, Cheng J, E W. Deep potential generation scheme and simulation protocol for the Li10GeP2S12-type superionic conductors. J Chem Phys. 2021;154(9):094703. doi:10.1063/5.0041849
- https://docs.deepmodeling.com/projects/deepmd/en/master/index.html
- https://github.com/deepmodeling/deepmd-kit
Yuliang Guo