ddd!DPA-1遇见指南 | 固态电解质实战之模型训练篇
©️ Copyright 2023 @ Authors
作者: 宋哲轩📨
日期:2023-07-20
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:你可以点击界面上方蓝色按钮 开始连接 ,选择 `deepmd-kit:2.2.1-cuda11.6` 镜像及`c12_m92_1 * NVIDIA V100`节点配置,稍等片刻即可运行。
*本教程需要具备基础知识:DeePMD-kit的使用,如果不熟悉的同学可以优先阅读《快速开始 DeePMD-kit|训练甲烷深度势能分子动力学模型》
🎯 欢迎阅读《ddd!DPA-1遇见指南 | 固态电解质实战之模型训练篇》
本指南将以结合DPA-1原文arXiv:2208.08236介绍DPA-1模型的研究背景、基本原理,并提供实用的代码示例,帮助理解重要的参数含义; 以固态电解质为例,手把手带你使用文献J. Chem. Phys. 154, 094703 (2021)中的训练集,训练DPA-1势函数模型。
快来和我们一起遇见DPA-1指南,开启势函数探索的新篇章吧!
📐一天,导师要求通过分子动力学模拟对固态电解质进行课题研究——
小A:排除精度不够的传统经典动力学(CMD,classical molecular dynamics)模拟🙅♀️
小B:排除精度很高,但是算不动大体系、长时间的从头算分子动力学(AIMD, ab initio molecular dynamics)🙅♀️
你:自然要结合当下最火热,也是集效率和精度于一身的机器学习势函数分子动力学模拟(MLMD, machine learning molecular dynamics)展开🙌
👍导师:我同意,去做吧!
在彻夜研究机器学习势函数计算后,你不禁思考:机器学习势函数是实用的,但是是否有一个马上拿来能用的模型呢?如果没有,我在训练时,能否利用一些公开的模型作为基础,稍微根据现有数据集调整一下就得到一个可靠的模型呢?
你发现,现有势函数模型/方法存在的缺陷:
- 一些通用模型应用场景局限,化学空间范围小
- 可以通过dpgen等工具获取丰富构象的数据集重新训练得到模型,但是花费较高
至此,你充分认识到:如果有一个大规模预训练模型/势函数就好了,它可以帮助我们省时省钱地得到一个适合应用场景势函数🚀。
📣 近日,深势科技以及北京科学智能研究院研究员张铎、毕航睿等人和合作者在arXiv上预发表了名为《DPA-1: Pretraining of Attention-based Deep Potential Model for Molecular Simulation》的文章
通过对元素类型更优的编码以及利用关键的注意力机制,极大提高了Deep Potential之前版本模型的容量和迁移能力,获得了覆盖元素周期表大多常见元素的大型预训练模型。在不同数据集上的迁移学习结果表明,模型能大幅降低新场景对数据的依赖。更多细节可见微信推送和原文。DPA-1的训练和分子动力学模拟功能均已在DeepModeling开源社区DeePMD-kit项目开源。相关工作在深势科技科学计算云平台Bohrium上完成。
现在,你已经知道了DPA-1是一个基于注意力机制的DP模型,它有效地描述了原子间相互作用;与训练后,可以显著减少下游任务地额外工作。
你只需要通过一篇指南快速掌握训练DPA-1势函数的方法(dpa_从头训),以及如何基于一个已有的大模型,根据现有数据集进行微调(dpa_finetune)得到势函数。
1. 学习目标
在本教程学习后,你将获得:
- DPA-1基本原理和应用;
- 以固态电解质为例,进行DPA-1势函数模型训练实战:输入脚本解读;从头训 vs. 已有预训练模型微调;模型评估测试
2. DPA-1简介
👂 迫不及待动手实践?可直接跳转至第3节~
2.1 研究背景
一直以来,势函数训练都在追求精度和效率的平衡。使用经典力场势函数方便快捷,但模拟精度难以更上一层楼;使用近来火热的AIMD(从头算分子动力学),势函数精度获得大幅提升,但计算资源花费难以在大体系、长时间的场景落地。随着AI for science的发展,机器学习手段使得训练高精度、高效率的势函数成为可能(图 1. 分子动力学模拟对比)。在MLMD的新范式下,量子化学计算(QM)不再直接应用于AIMD,而是作为生成机器学习势函数(MLP)的数据集准备; 当然,AIMD的计算结果也可以作为初始数据集。
然而,由于现有模型迁移能力不足、缺乏通用的大模型,面对一个新的复杂体系,要获得可用的、较为完备的势函数模型,科学家们基本上仍然需要获取大量计算数据并从头开始训练模型。随着电子结构数据的积累,类比计算机视觉(CV)或者自然语言处理(NLP)等其他人工智能领域的发展,**“预训练+少量数据微调”**是解决这个难题比较自然的想法。
为了实现这一范式,我们亟需一种具有强迁移能力、能容纳元素周期表大多数元素的模型结构。
2.2 研究方法
DPA-1模型是基于DP系列模型的一次全面升级,利用关键的门控注意力机制(Gated Attention Machanism),对原子间的相互作用实现了更为充分的建模,通过在现有数据上的训练,能够学习到更多隐藏的原子交互信息,极大提升了模型在包含不同构象、不同组分的数据集之间的迁移能力,从而也提升了在数据生成时的采样效率;并且模型通过对元素信息的编码,拓展了对元素的容量。开发者将模型在含有56种元素的较大数据集上进行了预训练,并将此预训练模型在各种下游任务上进行了迁移学习,实验表明,此预训练模型能大幅降低下游任务训练所需数据量及训练成本、提高模型预测精度,从而对分子模拟相关领域产生深远的影响(图2 DPA-1模型结构示意图)。
相比于之前的dp模型,DPA-1模型在研究方法上做了如下调整:
- 描述符:【元素类型编码】新增了原子类型作为具备嵌入矩阵的输入;引入【注意力机制】,根据原子的距离和角度重新加权得到原子之间的相互作用
- 损失函数的计算调整:为了使用新的数据集对预训练模型进行微调,首先使用新数据集的新统计结果改变预训练模型的能量偏差,然后修复预训练模型的部分参数并训练剩余参数
DPA-1在推理方面延续了DP系列模型的高效率,可以进行大规模原子、元素体系的分子动力学模拟。
2.3 实验验证
迁移性测试
- 三元合金数据集
- 固态电解质(SSE)数据集
- 高熵合金(HEA)数据集
注:OC20数据集:由物理结合到催化剂表面的单一吸附物(小分子)组成,催化剂表面覆盖有 56 种元素的周期性体相材料
为了测试DPA-1模型结构带来的迁移能力提升,研究者人为将不同训练集划分成了多个子集,每个子集之间的组分、构型有较大差异(以AlMgCu为例,single子集中仅包含单质数据;binary仅有二元数据,即Al-Mg,Al-Cu,Mg-Cu;而ternary则是剩余的三元数据)研究者在其中一些子集上训练,在另一些子集上进行测试,来考验模型在极端条件下的迁移能力。 (图3. DeepPot-SE 和 DPA-1 在不同设置和不同系统上的能量和力的学习曲线)。
可以看到,对比DeepPot-SE,在某些条件下DPA-1的测试精度甚至能实现一两个数量级的提升,这说明模型可以从现有数据中学习到隐含的原子间交互信息,也进一步证明了模型强大的迁移能力。
样本效率测试:案例场景(图4. 模型样本效率表现) DPA-1在仅有少量三元数据的场景下,也达到了较高的精度,对比DeepPot-SE可以节省大约90%的三元数据。
2.4 模型可解释性
为了进一步研究此覆盖元素周期表大多数元素的预训练模型的可解释性,研究者将模型中学习到的元素编码进行了PCA降维并可视化,如图5所示:
所有的元素在隐空间中呈螺旋状分布,同周期元素沿着螺旋下降,同族元素则垂直螺旋方向分布,巧妙地对应了其在元素周期表中的位置,也很好地证明了模型的可解释性。
2.5 未来展望
DPA-1的提出为机器学习势能函数生产打开了新的范式,证明了“预训练+少量任务微调”流程的可行性,未来研究者将继续致力于势能函数自动化生产、自动化测试,也会继续关注比如多任务训练、无监督学习、模型压缩、蒸馏等操作,方便用户一键生成下游任务所需的势能函数。此外,更大更全的数据库、下游任务与dflow工作流框架的结合也是未来极具发展性的方向。
3. 固态电解质实战: DPA-1势函数训练
学习了理论知识后,让我们直接动手实践吧! 本节,我们将以固态电解质数据集LiGePS-SSE-PBE为例,开展DPA-1的从头训和微调训练。
注:本教程所使用的数据集源自科学智能广场(AIS-Square),有更多模型和数据需求的同学赶快去探索一下吧~
3.1 数据集下载
fatal: destination path 'study_examples' already exists and is not an empty directory.
. ├── DeePMD-SSE.ipynb ├── dpa │ ├── checkpoint │ ├── dpa.pb │ ├── input.json │ ├── iter.000000 │ │ └── 02.fp │ ├── iter.000001 │ │ └── 02.fp │ └── iter.000002 │ └── 02.fp ├── dpa-sse.ipynb └── dpa_finetune ├── input.json ├── iter.000000 │ └── 02.fp ├── iter.000001 │ └── 02.fp └── iter.000002 └── 02.fp 14 directories, 6 files
3.2 输入脚本准备(从头训)
{ "model": { "descriptor": { "type": "se_atten", "sel": 60, "rcut_smth": 0.5, "rcut": 6.0, "neuron": [ 25, 50, 100 ], "resnet_dt": false, "axis_neuron": 16, "attn": 128, "attn_layer": 1, "attn_dotr": true, "attn_mask": false, "seed": 1801819940, "_activation_function": "tanh" }, "fitting_net": { "neuron": [ 240, 240, 240 ], "resnet_dt": true, "_coord_norm": true, "_type_fitting_net": false, "seed": 2375417769, "_activation_function": "tanh" }, "type_map": [ "Li", "Ge", "P", "S" ] }, "learning_rate": { "type": "exp", "start_lr": 0.001, "decay_steps": 50, "stop_lr": 3.51e-08 }, "loss": { "start_pref_e": 0.02, "limit_pref_e": 1, "start_pref_f": 1000, "limit_pref_f": 1, "start_pref_v": 0, "limit_pref_v": 0 }, "training": { "training_data": { "systems": [ "iter.000000/02.fp/data.000", "iter.000000/02.fp/data.001", "iter.000000/02.fp/data.002", "iter.000000/02.fp/data.003" ], "batch_size": 1 }, "validation_data": { "systems": [ "iter.000001/02.fp/data.000", "iter.000001/02.fp/data.001", "iter.000001/02.fp/data.002", "iter.000001/02.fp/data.003", "iter.000002/02.fp/data.000", "iter.000002/02.fp/data.001", "iter.000002/02.fp/data.002", "iter.000002/02.fp/data.003" ], "batch_size": 1 }, "numb_steps": 10000, "seed": 3982377700, "_comment": "that's all", "disp_file": "lcurve.out", "disp_freq": 100, "numb_test": 1, "save_freq": 2000, "save_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json" } }
相比于dp_se_e2模型,dpa使用了se_atten作为描述子,主要修改的参数集中在descriptor
部分
"descriptor": {
"type": "se_atten",
"rcut_smth": 0.5,
"rcut": 6.0,
"sel": 60,
"neuron": [25,50,100],
"axis_neuron": 16,
"resnet_dt": false,
"attn": 128,
"attn_layer": 2,
"attn_mask": false,
"attn_dotr": true,
"seed": 1801819940,
"_activation_function": "tanh"
},
相比于之前大家常用的 se_e2_a 描述子来说,有以下几个参数有区别:
type:"se_atten"
:表示采用DPA-1描述子结构;rcut
:邻近列表的截断半径,rcut_smth
:平滑起点;sel
:纳入考虑的最大邻居原子个数总和,这个值和DPA-1的训练效率高度相关,一般我们不用设置太大,推荐最大不超过200。DPA-1论文中训练含有56种元素的OC2M数据集,也只用了120就已经足够了;与dp_se_e2中输入的list型不同,这里sel
为int类型;neuron
:指定嵌入网络的大小axis_neuron
:嵌入矩阵的子矩阵大小,即DeepPot-SE论文中的the axis matrixresnet_dt
:若选项设置为true,则在ResNet中使用时间步长seed
:初始化模型参数时用于生成随机数的随机种子
除此以外,还新增了一些attention机制相关的参数:
attn"
:attention过程中的隐向量长度;attn_layer
:代表总共进行几层attention过程,一般我们推荐2层就可以;attn_mask
:代表是否将attention权重的对角线mask掉;attn_dotr
:代表是否对attention权重点乘相对坐标的乘积,类似一个门控注意力机制(Gated Attention Mechanism) 其他参数和大家常用的“se_e2_a”描述子中代表的含义保持一致,大家可以参考这里来获得更详细的解释。
其他注意事项:
对于模型的其他部分,首先DPA-1仅支持使用“ener”类型的拟合(fitting)网络,可以参考标准拟合网络的参数设置。
其次,对于DPA-1来说,会默认启用元素类型编码(type embedding)来编码元素相关的信息,扩大模型对元素类型的容量,默认参数如下:
"type_embedding":{
"neuron": [2, 4, 8],
"resnet_dt": false,
"seed": 1
},
其中的参数含义和标准元素编码参数保持一致,如果想修改这些默认参数,可以把上述参数手动加到type_embedding
进行自定义。
DPA-1非常适用于包含多种元素的体系,尤其是有十种以上元素的体系,这时候需要手动添加每种元素编号对应的元素种类,即type_map
参数:
"type_map": [
"Li",
"Ge",
"P",
"S"
]
3.3 模型训练(从头训)
2023-07-22 16:55:05.785055: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-22 16:55:06.825439: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 16:55:06.825532: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 16:55:06.825552: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module. /opt/conda/lib/python3.8/site-packages/deepmd/utils/compat.py:358: UserWarning: The argument training->numb_test has been deprecated since v2.0.0. Use training->validation_data->batch_size instead. warnings.warn( DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step) DEEPMD INFO training data with min nbor dist: 1.7120899465949608 DEEPMD INFO training data with max nbor size: [56] DEEPMD INFO _____ _____ __ __ _____ _ _ _ DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| | DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_ DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __| DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_ DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__| DEEPMD INFO Please read and cite: DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018) DEEPMD INFO installed to: /deepmd-kit/_skbuild/linux-x86_64-3.8/cmake-install DEEPMD INFO source : v2.2.0.b0-77-gc1299196 DEEPMD INFO source brach: devel DEEPMD INFO source commit: c1299196 DEEPMD INFO source commit at: 2023-02-28 09:06:04 +0800 DEEPMD INFO build float prec: double DEEPMD INFO build variant: cpu DEEPMD INFO build with tf inc: /opt/conda/lib/python3.8/site-packages/tensorflow/include;/opt/conda/lib/python3.8/site-packages/tensorflow/include DEEPMD INFO build with tf lib: DEEPMD INFO ---Summary of the training--------------------------------------- DEEPMD INFO running on: bohrium-13387-1030178 DEEPMD INFO computing device: gpu:0 DEEPMD INFO CUDA_VISIBLE_DEVICES: unset DEEPMD INFO Count of visible GPU: 1 DEEPMD INFO num_intra_threads: 0 DEEPMD INFO num_inter_threads: 0 DEEPMD INFO ----------------------------------------------------------------- DEEPMD INFO ---Summary of DataSystem: training ----------------------------------------------- DEEPMD INFO found 4 system(s): DEEPMD INFO system natoms bch_sz n_bch prob pbc DEEPMD INFO iter.000000/02.fp/data.000 400 1 127 0.241 T DEEPMD INFO iter.000000/02.fp/data.001 400 1 131 0.248 T DEEPMD INFO iter.000000/02.fp/data.002 400 1 133 0.252 T DEEPMD INFO iter.000000/02.fp/data.003 400 1 137 0.259 T DEEPMD INFO -------------------------------------------------------------------------------------- DEEPMD INFO ---Summary of DataSystem: validation ----------------------------------------------- DEEPMD INFO found 8 system(s): DEEPMD INFO system natoms bch_sz n_bch prob pbc DEEPMD INFO iter.000001/02.fp/data.000 400 1 117 0.111 T DEEPMD INFO iter.000001/02.fp/data.001 400 1 137 0.129 T DEEPMD INFO iter.000001/02.fp/data.002 400 1 137 0.129 T DEEPMD INFO iter.000001/02.fp/data.003 400 1 138 0.130 T DEEPMD INFO iter.000002/02.fp/data.000 400 1 133 0.126 T DEEPMD INFO iter.000002/02.fp/data.001 400 1 133 0.126 T DEEPMD INFO iter.000002/02.fp/data.002 400 1 134 0.127 T DEEPMD INFO iter.000002/02.fp/data.003 400 1 129 0.122 T DEEPMD INFO -------------------------------------------------------------------------------------- DEEPMD INFO training without frame parameter DEEPMD INFO data stating... (this step may take long time) DEEPMD INFO built lr DEEPMD INFO built network DEEPMD INFO built training WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. DEEPMD INFO initialize model from scratch DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 50, decay_rate 0.950006, final lr will be 3.51e-08 DEEPMD INFO batch 100 training time 8.37 s, testing time 0.05 s DEEPMD INFO batch 200 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 300 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 400 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 500 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 600 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 700 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 800 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 900 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1000 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 1100 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1200 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 1300 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1400 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1500 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1600 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1700 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 1800 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 1900 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 2000 training time 6.19 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 2100 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2200 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2300 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2400 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 2600 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 2700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 2800 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 2900 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3000 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3100 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 3200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3800 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 3900 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 4000 training time 6.17 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 4100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4500 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 4600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4800 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 4900 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 5000 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 5100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 5200 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 5300 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 5400 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 5500 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 5600 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 5700 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 5800 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 5900 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 6000 training time 6.18 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 6100 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 6200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6400 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 6500 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 6600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6800 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6900 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7000 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7200 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 7300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7500 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7700 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 7800 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 7900 training time 6.23 s, testing time 0.05 s DEEPMD INFO batch 8000 training time 6.24 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 8100 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 8200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8400 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 8500 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 8600 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 8700 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 8800 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 8900 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 9000 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 9100 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 9200 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 9300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9600 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 9700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9800 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 9900 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 10000 training time 6.16 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO average training time: 0.0618 s/batch (exclude first 100 batches) DEEPMD INFO finished training DEEPMD INFO wall time: 635.323 s 2023-07-22 17:05:57.559565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-22 17:05:58.573920: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:05:58.574019: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:05:58.574036: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. DEEPMD INFO The following nodes will be frozen: ['model_type', 'descrpt_attr/rcut', 'descrpt_attr/ntypes', 'model_attr/tmap', 'model_attr/model_type', 'model_attr/model_version', 'train_attr/min_nbor_dist', 'train_attr/training_script', 'o_energy', 'o_force', 'o_virial', 'o_atom_energy', 'o_atom_virial', 'fitting_attr/dfparam', 'fitting_attr/daparam'] WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/entrypoints/freeze.py:354: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/entrypoints/freeze.py:354: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/convert_to_constants.py:943: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/convert_to_constants.py:943: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. DEEPMD INFO 1334 ops in the final graph.
3.4 模型微调训练
{ "model": { "type_embedding":{"trainable": true}, "descriptor": {"trainable": true}, "fitting_net": {"trainable": true}, "type_map": [ "Li", "Ge", "P", "S" ] }, "learning_rate": { "type": "exp", "start_lr": 0.001, "decay_steps": 50, "stop_lr": 3.51e-08 }, "loss": { "type": "ener", "start_pref_e": 0.02, "limit_pref_e": 1, "start_pref_f": 1000, "limit_pref_f": 1, "start_pref_v": 0, "limit_pref_v": 0 }, "training": { "training_data": { "systems": [ "iter.000000/02.fp/data.000", "iter.000000/02.fp/data.001", "iter.000000/02.fp/data.002", "iter.000000/02.fp/data.003" ], "batch_size": 1 }, "validation_data": { "systems": [ "iter.000001/02.fp/data.000", "iter.000001/02.fp/data.001", "iter.000001/02.fp/data.002", "iter.000001/02.fp/data.003", "iter.000002/02.fp/data.000", "iter.000002/02.fp/data.001", "iter.000002/02.fp/data.002", "iter.000002/02.fp/data.003" ], "batch_size": 1 }, "numb_steps": 10000, "seed": 3982377700, "_comment": "that's all", "disp_file": "lcurve.out", "disp_freq": 100, "numb_test": 1, "save_freq": 2000, "save_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json" } }
在微调训练部分的输入文件中,我们只需将type_embedding
、descriptor
、fitting_net
参数改为trainable即可,无需重复书写
"model": {
"type_embedding":{"trainable": true},
"descriptor": {"trainable": true},
"fitting_net": {"trainable": true},
在本教程中,我们简单使用刚才已经训练好的dpa.pb作为预训练模型示例,进行微调训练
微调训练的命令增加了--finetune dpa.pb
选项
2023-07-22 17:08:43.230841: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-22 17:08:44.271205: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:08:44.271303: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:08:44.271321: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module. DEEPMD INFO Change the model configurations according to the pretrained one... DEEPMD INFO Change the 'descriptor' from {'trainable': True} to {'type': 'se_atten', 'sel': 60, 'rcut_smth': 0.5, 'rcut': 6.0, 'neuron': [25, 50, 100], 'resnet_dt': False, 'axis_neuron': 16, 'attn': 128, 'attn_layer': 2, 'attn_dotr': True, 'attn_mask': False, 'seed': 1801819940, 'activation_function': 'tanh', 'type_one_side': False, 'precision': 'default', 'trainable': True, 'exclude_types': []}. DEEPMD INFO Change the 'fitting_net' from {'trainable': True} to {'neuron': [240, 240, 240], 'resnet_dt': True, 'seed': 2375417769, 'type': 'ener', 'numb_fparam': 0, 'numb_aparam': 0, 'activation_function': 'tanh', 'precision': 'default', 'trainable': True, 'rcond': 0.001, 'atom_ener': [], 'use_aparam_as_mask': False}. /opt/conda/lib/python3.8/site-packages/deepmd/utils/compat.py:358: UserWarning: The argument training->numb_test has been deprecated since v2.0.0. Use training->validation_data->batch_size instead. warnings.warn( DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step) DEEPMD INFO training data with min nbor dist: 1.7120899465949608 DEEPMD INFO training data with max nbor size: [56] DEEPMD INFO _____ _____ __ __ _____ _ _ _ DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| | DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_ DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __| DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_ DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__| DEEPMD INFO Please read and cite: DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018) DEEPMD INFO installed to: /deepmd-kit/_skbuild/linux-x86_64-3.8/cmake-install DEEPMD INFO source : v2.2.0.b0-77-gc1299196 DEEPMD INFO source brach: devel DEEPMD INFO source commit: c1299196 DEEPMD INFO source commit at: 2023-02-28 09:06:04 +0800 DEEPMD INFO build float prec: double DEEPMD INFO build variant: cpu DEEPMD INFO build with tf inc: /opt/conda/lib/python3.8/site-packages/tensorflow/include;/opt/conda/lib/python3.8/site-packages/tensorflow/include DEEPMD INFO build with tf lib: DEEPMD INFO ---Summary of the training--------------------------------------- DEEPMD INFO running on: bohrium-13387-1030178 DEEPMD INFO computing device: gpu:0 DEEPMD INFO CUDA_VISIBLE_DEVICES: unset DEEPMD INFO Count of visible GPU: 1 DEEPMD INFO num_intra_threads: 0 DEEPMD INFO num_inter_threads: 0 DEEPMD INFO ----------------------------------------------------------------- DEEPMD INFO ---Summary of DataSystem: training ----------------------------------------------- DEEPMD INFO found 4 system(s): DEEPMD INFO system natoms bch_sz n_bch prob pbc DEEPMD INFO iter.000000/02.fp/data.000 400 1 127 0.241 T DEEPMD INFO iter.000000/02.fp/data.001 400 1 131 0.248 T DEEPMD INFO iter.000000/02.fp/data.002 400 1 133 0.252 T DEEPMD INFO iter.000000/02.fp/data.003 400 1 137 0.259 T DEEPMD INFO -------------------------------------------------------------------------------------- DEEPMD INFO ---Summary of DataSystem: validation ----------------------------------------------- DEEPMD INFO found 8 system(s): DEEPMD INFO system natoms bch_sz n_bch prob pbc DEEPMD INFO iter.000001/02.fp/data.000 400 1 117 0.111 T DEEPMD INFO iter.000001/02.fp/data.001 400 1 137 0.129 T DEEPMD INFO iter.000001/02.fp/data.002 400 1 137 0.129 T DEEPMD INFO iter.000001/02.fp/data.003 400 1 138 0.130 T DEEPMD INFO iter.000002/02.fp/data.000 400 1 133 0.126 T DEEPMD INFO iter.000002/02.fp/data.001 400 1 133 0.126 T DEEPMD INFO iter.000002/02.fp/data.002 400 1 134 0.127 T DEEPMD INFO iter.000002/02.fp/data.003 400 1 129 0.122 T DEEPMD INFO -------------------------------------------------------------------------------------- DEEPMD INFO training without frame parameter DEEPMD INFO Changing energy bias in pretrained model for types ['Li', 'Ge', 'P', 'S']... (this step may take long time) WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. DEEPMD INFO Adjust batch size from 1024 to 2048 DEEPMD INFO Adjust batch size from 2048 to 4096 DEEPMD INFO Adjust batch size from 4096 to 8192 DEEPMD INFO RMSE of atomic energy after linear regression is: 0.0018085361367408837 eV/atom. DEEPMD INFO Change energy bias of ['Li', 'Ge', 'P', 'S'] from [-4.17483491 -0.41748349 -0.83496698 -5.00980189] to [-4.1760068 -0.41760068 -0.83520136 -5.01120816]. DEEPMD INFO built lr DEEPMD INFO built network DEEPMD INFO built training WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. DEEPMD INFO initialize training from the frozen pretrained model DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 50, decay_rate 0.950006, final lr will be 3.51e-08 DEEPMD INFO batch 100 training time 8.34 s, testing time 0.05 s DEEPMD INFO batch 200 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 300 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 400 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 500 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 600 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 700 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 800 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 900 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 1000 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1100 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1200 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1300 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1400 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1500 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 1600 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1700 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1800 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1900 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 2000 training time 6.20 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 2100 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 2200 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 2300 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2400 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 2500 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2600 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2700 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 2800 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 2900 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3000 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3100 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 3200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3800 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3900 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 4000 training time 6.16 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 4100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4200 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 4300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4500 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 4600 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 4700 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 4800 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4900 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 5000 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 5100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 5200 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 5300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 5400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 5500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 5600 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 5700 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 5800 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 5900 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 6000 training time 6.16 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 6100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6400 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 6500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6700 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 6800 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 6900 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7000 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7100 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7200 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7300 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7500 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 7600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7700 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7800 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7900 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8000 training time 6.15 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 8100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8700 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 8800 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8900 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9000 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 9100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9300 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 9400 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 9500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9600 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 9700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9800 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 9900 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 10000 training time 6.18 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO average training time: 0.0617 s/batch (exclude first 100 batches) DEEPMD INFO finished training DEEPMD INFO wall time: 634.868 s 2023-07-22 17:19:41.065653: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-22 17:19:42.102660: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:19:42.102767: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:19:42.102785: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. DEEPMD INFO The following nodes will be frozen: ['model_type', 'descrpt_attr/rcut', 'descrpt_attr/ntypes', 'model_attr/tmap', 'model_attr/model_type', 'model_attr/model_version', 'train_attr/min_nbor_dist', 'train_attr/training_script', 'o_energy', 'o_force', 'o_virial', 'o_atom_energy', 'o_atom_virial', 'fitting_attr/dfparam', 'fitting_attr/daparam'] WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/entrypoints/freeze.py:354: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/entrypoints/freeze.py:354: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/convert_to_constants.py:943: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/convert_to_constants.py:943: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. DEEPMD INFO 1336 ops in the final graph.
一个基于已有预训练模型的微调势函数就训练好了!总结一下,它与普通的模型训练差异在于:
- input.json的参数设置
- 命令行增加
--fintune pretained.pb
3.5 模型检验
现在,我们已经完成了固态电解质dpa势函数的从头训和微调训练。让我们通过lcurve(学习率变化曲线输出文件)和dp test功能检验下模型的表现吧!
📌注:由于节省时间,教程中的训练步数没有到达一个收敛的结果;在真实训练场景中,记得调整至合适的训练时长~
# step rmse_val rmse_trn rmse_e_val rmse_e_trn rmse_f_val rmse_f_trn lr 0 2.77e+01 2.71e+01 2.01e+00 2.03e+00 8.58e-01 8.36e-01 1.0e-03 100 1.83e+01 1.62e+01 8.53e-02 5.69e-02 6.07e-01 5.38e-01 9.0e-04 200 9.90e+00 9.35e+00 1.78e-01 1.78e-01 3.42e-01 3.23e-01 8.1e-04 300 7.22e+00 6.22e+00 9.65e-02 1.05e-01 2.64e-01 2.26e-01 7.4e-04 400 6.61e+00 5.67e+00 7.50e-03 1.31e-02 2.57e-01 2.20e-01 6.6e-04 500 5.75e+00 5.45e+00 5.07e-02 4.98e-02 2.34e-01 2.21e-01 6.0e-04 600 5.83e+00 4.90e+00 1.72e-03 3.21e-03 2.51e-01 2.11e-01 5.4e-04 700 3.94e+00 4.28e+00 2.77e-02 3.05e-02 1.77e-01 1.93e-01 4.9e-04 800 4.64e+00 4.28e+00 6.33e-02 6.20e-02 2.16e-01 1.99e-01 4.4e-04 900 4.17e+00 4.59e+00 2.14e-02 1.86e-02 2.09e-01 2.30e-01 4.0e-04 1000 4.24e+00 3.72e+00 2.79e-02 2.48e-02 2.22e-01 1.95e-01 3.6e-04 1100 3.41e+00 3.72e+00 4.79e-02 4.03e-02 1.84e-01 2.03e-01 3.2e-04 1200 3.17e+00 3.45e+00 1.29e-02 1.99e-02 1.85e-01 2.00e-01 2.9e-04 1300 3.46e+00 3.29e+00 8.73e-02 7.90e-02 1.92e-01 1.84e-01 2.6e-04 1400 3.06e+00 3.19e+00 5.29e-02 4.67e-02 1.89e-01 2.00e-01 2.4e-04 1500 2.89e+00 2.60e+00 7.33e-03 9.06e-03 1.96e-01 1.77e-01 2.1e-04 1600 2.75e+00 2.56e+00 5.06e-03 1.38e-04 1.97e-01 1.84e-01 1.9e-04 1700 2.44e+00 2.52e+00 1.66e-02 1.52e-02 1.82e-01 1.89e-01 1.7e-04 1800 1.87e+00 2.47e+00 2.41e-02 7.82e-03 1.44e-01 1.96e-01 1.6e-04 1900 2.24e+00 2.21e+00 1.16e-02 1.34e-02 1.86e-01 1.84e-01 1.4e-04 2000 2.03e+00 1.88e+00 1.75e-02 2.10e-02 1.76e-01 1.62e-01 1.3e-04 2100 1.95e+00 1.91e+00 1.21e-02 1.50e-02 1.79e-01 1.75e-01 1.2e-04 2200 1.89e+00 1.99e+00 1.06e-02 1.47e-02 1.83e-01 1.92e-01 1.0e-04 2300 1.81e+00 1.78e+00 9.27e-03 6.88e-03 1.85e-01 1.82e-01 9.4e-05 2400 1.63e+00 1.65e+00 1.87e-02 1.64e-02 1.72e-01 1.75e-01 8.5e-05 2500 1.76e+00 1.67e+00 1.97e-02 2.27e-02 1.95e-01 1.82e-01 7.7e-05 2600 1.50e+00 1.51e+00 9.35e-03 6.31e-03 1.78e-01 1.79e-01 6.9e-05 2700 1.47e+00 1.22e+00 1.93e-03 4.47e-03 1.84e-01 1.53e-01 6.3e-05 2800 1.44e+00 1.44e+00 1.51e-02 1.35e-02 1.86e-01 1.87e-01 5.7e-05 2900 1.29e+00 1.42e+00 1.09e-02 8.51e-03 1.77e-01 1.96e-01 5.1e-05 3000 1.30e+00 1.28e+00 8.11e-03 7.30e-03 1.88e-01 1.85e-01 4.6e-05 3100 1.07e+00 1.21e+00 3.57e-03 6.98e-03 1.64e-01 1.84e-01 4.2e-05 3200 1.12e+00 1.07e+00 4.16e-03 2.35e-03 1.80e-01 1.73e-01 3.8e-05 3300 1.22e+00 1.05e+00 2.06e-03 3.05e-03 2.07e-01 1.78e-01 3.4e-05 3400 1.04e+00 1.00e+00 1.11e-02 5.72e-03 1.82e-01 1.77e-01 3.1e-05 3500 8.83e-01 9.68e-01 4.07e-03 4.30e-03 1.65e-01 1.80e-01 2.8e-05 3600 9.95e-01 9.55e-01 1.16e-02 7.57e-03 1.90e-01 1.85e-01 2.5e-05 3700 8.91e-01 8.59e-01 9.37e-03 1.26e-02 1.80e-01 1.70e-01 2.2e-05 3800 9.13e-01 8.02e-01 9.55e-04 3.18e-03 1.98e-01 1.73e-01 2.0e-05 3900 8.30e-01 7.84e-01 8.00e-03 4.25e-03 1.85e-01 1.77e-01 1.8e-05 4000 7.63e-01 8.00e-01 1.97e-03 1.83e-03 1.82e-01 1.91e-01 1.7e-05 4100 7.43e-01 7.65e-01 8.91e-03 1.14e-02 1.81e-01 1.83e-01 1.5e-05 4200 7.29e-01 6.09e-01 1.04e-02 7.09e-03 1.84e-01 1.56e-01 1.3e-05 4300 7.21e-01 7.04e-01 1.78e-03 6.26e-03 1.99e-01 1.91e-01 1.2e-05 4400 6.60e-01 6.26e-01 1.30e-03 5.52e-03 1.91e-01 1.78e-01 1.1e-05 4500 6.25e-01 5.83e-01 6.60e-03 3.96e-03 1.85e-01 1.75e-01 9.9e-06 4600 5.86e-01 5.10e-01 3.30e-04 5.71e-03 1.86e-01 1.58e-01 8.9e-06 4700 5.21e-01 5.50e-01 6.89e-03 5.96e-03 1.67e-01 1.78e-01 8.1e-06 4800 5.68e-01 5.41e-01 3.52e-03 1.32e-03 1.96e-01 1.88e-01 7.3e-06 4900 4.90e-01 5.28e-01 4.22e-03 6.94e-03 1.76e-01 1.86e-01 6.6e-06 5000 4.88e-01 4.61e-01 2.72e-03 1.98e-03 1.84e-01 1.75e-01 5.9e-06 5100 4.75e-01 4.26e-01 6.41e-04 2.15e-03 1.88e-01 1.68e-01 5.3e-06 5200 4.48e-01 4.38e-01 5.80e-03 1.03e-02 1.80e-01 1.61e-01 4.8e-06 5300 4.53e-01 4.28e-01 6.71e-03 1.81e-03 1.87e-01 1.85e-01 4.4e-06 5400 4.24e-01 4.13e-01 2.37e-03 8.79e-04 1.90e-01 1.86e-01 3.9e-06 5500 4.34e-01 4.11e-01 4.81e-04 3.49e-03 2.04e-01 1.90e-01 3.5e-06 5600 3.85e-01 3.41e-01 2.10e-04 4.40e-03 1.88e-01 1.61e-01 3.2e-06 5700 3.46e-01 3.53e-01 3.73e-03 3.43e-03 1.72e-01 1.76e-01 2.9e-06 5800 3.55e-01 3.47e-01 2.80e-03 5.82e-04 1.85e-01 1.83e-01 2.6e-06 5900 3.28e-01 3.29e-01 1.18e-03 1.85e-03 1.79e-01 1.79e-01 2.4e-06 6000 3.09e-01 3.30e-01 3.07e-03 1.18e-03 1.71e-01 1.86e-01 2.1e-06 6100 3.23e-01 3.17e-01 6.21e-03 2.25e-03 1.75e-01 1.84e-01 1.9e-06 6200 3.14e-01 3.30e-01 3.57e-04 3.39e-03 1.90e-01 1.96e-01 1.7e-06 6300 3.07e-01 3.15e-01 9.20e-04 3.24e-03 1.92e-01 1.93e-01 1.6e-06 6400 2.97e-01 2.99e-01 1.34e-03 5.33e-03 1.91e-01 1.80e-01 1.4e-06 6500 2.72e-01 2.96e-01 2.25e-03 6.45e-03 1.78e-01 1.77e-01 1.3e-06 6600 2.73e-01 2.93e-01 1.84e-03 2.25e-03 1.85e-01 1.98e-01 1.1e-06 6700 2.86e-01 2.77e-01 4.48e-03 5.85e-03 1.90e-01 1.76e-01 1.0e-06 6800 2.56e-01 2.75e-01 1.36e-03 2.78e-03 1.83e-01 1.93e-01 9.3e-07 6900 2.60e-01 2.64e-01 2.62e-03 2.45e-03 1.88e-01 1.91e-01 8.4e-07 7000 2.40e-01 2.35e-01 3.99e-03 4.34e-03 1.71e-01 1.64e-01 7.6e-07 7100 2.40e-01 2.54e-01 1.74e-03 3.09e-03 1.83e-01 1.90e-01 6.9e-07 7200 2.18e-01 2.19e-01 1.32e-03 1.63e-03 1.70e-01 1.70e-01 6.2e-07 7300 2.31e-01 2.30e-01 3.02e-03 2.25e-03 1.79e-01 1.80e-01 5.6e-07 7400 2.31e-01 2.41e-01 3.03e-03 1.80e-03 1.82e-01 1.94e-01 5.1e-07 7500 2.24e-01 2.41e-01 1.50e-04 2.81e-03 1.86e-01 1.94e-01 4.6e-07 7600 2.29e-01 2.35e-01 1.44e-03 1.79e-03 1.91e-01 1.96e-01 4.1e-07 7700 2.11e-01 1.94e-01 6.94e-04 2.66e-03 1.80e-01 1.59e-01 3.7e-07 7800 2.09e-01 2.20e-01 2.60e-03 1.23e-03 1.75e-01 1.89e-01 3.4e-07 7900 2.20e-01 1.96e-01 3.74e-03 3.86e-03 1.82e-01 1.58e-01 3.0e-07 8000 2.14e-01 2.81e-01 3.20e-03 1.18e-02 1.81e-01 1.35e-01 2.7e-07 8100 2.14e-01 2.14e-01 3.17e-03 8.66e-05 1.83e-01 1.92e-01 2.5e-07 8200 2.46e-01 2.11e-01 7.13e-03 4.51e-04 1.81e-01 1.91e-01 2.2e-07 8300 2.13e-01 2.05e-01 2.08e-03 2.95e-03 1.90e-01 1.79e-01 2.0e-07 8400 2.06e-01 1.96e-01 3.87e-03 2.31e-04 1.76e-01 1.80e-01 1.8e-07 8500 1.93e-01 2.19e-01 8.70e-04 2.88e-03 1.78e-01 1.96e-01 1.6e-07 8600 2.18e-01 2.14e-01 2.43e-03 8.26e-04 1.98e-01 1.99e-01 1.5e-07 8700 2.12e-01 2.06e-01 2.30e-03 3.80e-03 1.95e-01 1.80e-01 1.3e-07 8800 2.31e-01 2.02e-01 6.19e-03 3.20e-03 1.84e-01 1.81e-01 1.2e-07 8900 1.93e-01 1.79e-01 2.89e-03 4.03e-04 1.75e-01 1.70e-01 1.1e-07 9000 1.82e-01 1.93e-01 9.33e-04 1.00e-03 1.73e-01 1.83e-01 9.8e-08 9100 1.83e-01 2.22e-01 1.07e-03 2.42e-03 1.74e-01 2.07e-01 8.8e-08 9200 1.93e-01 1.95e-01 3.45e-03 2.74e-03 1.73e-01 1.80e-01 8.0e-08 9300 2.07e-01 1.89e-01 4.09e-03 2.20e-05 1.83e-01 1.83e-01 7.2e-08 9400 2.10e-01 2.30e-01 1.47e-04 3.88e-03 2.03e-01 2.10e-01 6.5e-08 9500 2.28e-01 1.81e-01 6.78e-03 1.27e-03 1.78e-01 1.74e-01 5.9e-08 9600 2.00e-01 1.93e-01 1.75e-03 4.21e-03 1.92e-01 1.69e-01 5.3e-08 9700 1.90e-01 1.69e-01 2.14e-03 2.24e-03 1.81e-01 1.59e-01 4.8e-08 9800 1.85e-01 2.30e-01 1.29e-04 5.23e-03 1.81e-01 2.01e-01 4.3e-08 9900 1.88e-01 1.97e-01 3.34e-03 2.92e-03 1.73e-01 1.85e-01 3.9e-08 10000 1.79e-01 2.07e-01 1.66e-03 3.54e-03 1.73e-01 1.91e-01 3.5e-08
想要直观对比两个模型的lcurve差异?一键运行为大家准备好的可视化脚本试试~
['dpa/lcurve.out', 'dpa_finetune/lcurve.out']
根据lcurve的变化对比,我们可以看出基于预训练展开的微调模型训练初始具有更低的能量和力损失
让我们计算预测数据和原始数据之间的相关性并可视化查看一下。
2023-07-22 17:21:20.104037: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-22 17:21:21.129529: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:21:21.129622: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:21:21.129639: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.000 DEEPMD INFO Adjust batch size from 1024 to 2048 DEEPMD INFO Adjust batch size from 2048 to 4096 DEEPMD INFO Adjust batch size from 4096 to 8192 DEEPMD INFO Adjust batch size from 8192 to 16384 DEEPMD INFO Adjust batch size from 16384 to 32768 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 9.427785e-01 eV DEEPMD INFO Energy RMSE : 1.196250e+00 eV DEEPMD INFO Energy MAE/Natoms : 2.356946e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.990626e-03 eV DEEPMD INFO Force MAE : 1.336536e-01 eV/A DEEPMD INFO Force RMSE : 1.815210e-01 eV/A DEEPMD INFO Virial MAE : 9.645088e+00 eV DEEPMD INFO Virial RMSE : 1.431580e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.411272e-02 eV DEEPMD INFO Virial RMSE/Natoms : 3.578950e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.001 2023-07-22 17:21:42.161708: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize/Square If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. Current allocation summary follows. Current allocation summary follows. 2023-07-22 17:21:42.161959: W tensorflow/tsl/framework/bfc_allocator.cc:492] ********************************************___*************************_*************************_* 2023-07-22 17:21:42.163651: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory 2023-07-22 17:21:52.165476: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize_1/Square If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. Current allocation summary follows. Current allocation summary follows. 2023-07-22 17:21:52.165726: W tensorflow/tsl/framework/bfc_allocator.cc:492] ********************************************___*************************_*************************_* 2023-07-22 17:21:52.165750: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory 2023-07-22 17:22:02.165943: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize_2/Square If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. Current allocation summary follows. Current allocation summary follows. 2023-07-22 17:22:02.166201: W tensorflow/tsl/framework/bfc_allocator.cc:492] ********************************************___*************************_*************************_* 2023-07-22 17:22:02.166226: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory DEEPMD INFO Adjust batch size from 32768 to 16384 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 8.434060e-01 eV DEEPMD INFO Energy RMSE : 1.056923e+00 eV DEEPMD INFO Energy MAE/Natoms : 2.108515e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.642309e-03 eV DEEPMD INFO Force MAE : 1.342740e-01 eV/A DEEPMD INFO Force RMSE : 1.823510e-01 eV/A DEEPMD INFO Virial MAE : 9.380907e+00 eV DEEPMD INFO Virial RMSE : 1.377223e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.345227e-02 eV DEEPMD INFO Virial RMSE/Natoms : 3.443056e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.002 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 1.096916e+00 eV DEEPMD INFO Energy RMSE : 1.411227e+00 eV DEEPMD INFO Energy MAE/Natoms : 2.742291e-03 eV DEEPMD INFO Energy RMSE/Natoms : 3.528067e-03 eV DEEPMD INFO Force MAE : 1.328904e-01 eV/A DEEPMD INFO Force RMSE : 1.803824e-01 eV/A DEEPMD INFO Virial MAE : 8.762420e+00 eV DEEPMD INFO Virial RMSE : 1.282206e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.190605e-02 eV DEEPMD INFO Virial RMSE/Natoms : 3.205516e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.003 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 1.005292e+00 eV DEEPMD INFO Energy RMSE : 1.240209e+00 eV DEEPMD INFO Energy MAE/Natoms : 2.513231e-03 eV DEEPMD INFO Energy RMSE/Natoms : 3.100522e-03 eV DEEPMD INFO Force MAE : 1.341954e-01 eV/A DEEPMD INFO Force RMSE : 1.820485e-01 eV/A DEEPMD INFO Virial MAE : 8.224030e+00 eV DEEPMD INFO Virial RMSE : 1.197917e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.056007e-02 eV DEEPMD INFO Virial RMSE/Natoms : 2.994793e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ----------weighted average of errors----------- DEEPMD INFO # number of systems : 4 DEEPMD INFO Energy MAE : 9.720983e-01 eV DEEPMD INFO Energy RMSE : 1.232658e+00 eV DEEPMD INFO Energy MAE/Natoms : 2.430246e-03 eV DEEPMD INFO Energy RMSE/Natoms : 3.081644e-03 eV DEEPMD INFO Force MAE : 1.337533e-01 eV/A DEEPMD INFO Force RMSE : 1.815773e-01 eV/A DEEPMD INFO Virial MAE : 9.003111e+00 eV DEEPMD INFO Virial RMSE : 1.325257e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.250778e-02 eV DEEPMD INFO Virial RMSE/Natoms : 3.313142e-02 eV DEEPMD INFO # ----------------------------------------------- 2023-07-22 17:22:21.320361: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-22 17:22:22.353270: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:22:22.353362: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:22:22.353379: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.000 DEEPMD INFO Adjust batch size from 1024 to 2048 DEEPMD INFO Adjust batch size from 2048 to 4096 DEEPMD INFO Adjust batch size from 4096 to 8192 DEEPMD INFO Adjust batch size from 8192 to 16384 DEEPMD INFO Adjust batch size from 16384 to 32768 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 8.621734e-01 eV DEEPMD INFO Energy RMSE : 1.057169e+00 eV DEEPMD INFO Energy MAE/Natoms : 2.155433e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.642923e-03 eV DEEPMD INFO Force MAE : 1.101241e-01 eV/A DEEPMD INFO Force RMSE : 1.489321e-01 eV/A DEEPMD INFO Virial MAE : 1.266581e+01 eV DEEPMD INFO Virial RMSE : 1.956447e+01 eV DEEPMD INFO Virial MAE/Natoms : 3.166453e-02 eV DEEPMD INFO Virial RMSE/Natoms : 4.891117e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.001 2023-07-22 17:22:43.410028: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/c_value/MatMul If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. Current allocation summary follows. Current allocation summary follows. 2023-07-22 17:22:43.410321: W tensorflow/tsl/framework/bfc_allocator.cc:492] *******************************************____*************************___******************_****_* 2023-07-22 17:22:43.410369: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at matmul_op_impl.h:731 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[1944000,128] and type double on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 2023-07-22 17:22:53.412286: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize/Square If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. Current allocation summary follows. Current allocation summary follows. 2023-07-22 17:22:53.412596: W tensorflow/tsl/framework/bfc_allocator.cc:492] *******************************************____*************************___******************_****_* 2023-07-22 17:22:53.412637: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory 2023-07-22 17:23:03.412833: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize_1/Square If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. Current allocation summary follows. Current allocation summary follows. 2023-07-22 17:23:03.413122: W tensorflow/tsl/framework/bfc_allocator.cc:492] *******************************************____*************************___******************_****_* 2023-07-22 17:23:03.413148: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory DEEPMD INFO Adjust batch size from 32768 to 16384 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 6.472647e-01 eV DEEPMD INFO Energy RMSE : 8.226297e-01 eV DEEPMD INFO Energy MAE/Natoms : 1.618162e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.056574e-03 eV DEEPMD INFO Force MAE : 1.106034e-01 eV/A DEEPMD INFO Force RMSE : 1.492427e-01 eV/A DEEPMD INFO Virial MAE : 1.168436e+01 eV DEEPMD INFO Virial RMSE : 1.808673e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.921089e-02 eV DEEPMD INFO Virial RMSE/Natoms : 4.521683e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.002 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 6.905101e-01 eV DEEPMD INFO Energy RMSE : 8.749119e-01 eV DEEPMD INFO Energy MAE/Natoms : 1.726275e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.187280e-03 eV DEEPMD INFO Force MAE : 1.094928e-01 eV/A DEEPMD INFO Force RMSE : 1.477969e-01 eV/A DEEPMD INFO Virial MAE : 1.099581e+01 eV DEEPMD INFO Virial RMSE : 1.688399e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.748951e-02 eV DEEPMD INFO Virial RMSE/Natoms : 4.220997e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.003 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 6.999822e-01 eV DEEPMD INFO Energy RMSE : 8.619741e-01 eV DEEPMD INFO Energy MAE/Natoms : 1.749956e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.154935e-03 eV DEEPMD INFO Force MAE : 1.101929e-01 eV/A DEEPMD INFO Force RMSE : 1.488093e-01 eV/A DEEPMD INFO Virial MAE : 1.067116e+01 eV DEEPMD INFO Virial RMSE : 1.624072e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.667789e-02 eV DEEPMD INFO Virial RMSE/Natoms : 4.060179e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ----------weighted average of errors----------- DEEPMD INFO # number of systems : 4 DEEPMD INFO Energy MAE : 7.249826e-01 eV DEEPMD INFO Energy RMSE : 9.086799e-01 eV DEEPMD INFO Energy MAE/Natoms : 1.812456e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.271700e-03 eV DEEPMD INFO Force MAE : 1.101033e-01 eV/A DEEPMD INFO Force RMSE : 1.486962e-01 eV/A DEEPMD INFO Virial MAE : 1.150428e+01 eV DEEPMD INFO Virial RMSE : 1.773928e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.876071e-02 eV DEEPMD INFO Virial RMSE/Natoms : 4.434820e-02 eV DEEPMD INFO # -----------------------------------------------
同样地,我们还是借助可视化的方式查看模型的准确度
正在处理的文件: result_dpa.e_peratom.out 输出位置:./dp_test_dpa.png 正在处理的文件: result_finetune.e_peratom.out 输出位置:./dp_test_finetune.png
从Energy RMSE/Natoms和Force RMSE对比结果看,同样的条件下基于预训练微调后的训练的finetune模型精度更高。
我们本次指南就到这里结束啦。怎么样,是不是对DPA-1进行预训练模型已经跃跃欲试了?尝试学以致用在自己的课题上吧~
这天,导师对你的模型训练策略赞赏有加👍。
这时,你不禁又思考了一个问题:模型已经训练完成,我该如何应用已有模型进行分子动力学模拟呢?
ddd! 欢迎阅读:
- 深度势能分子动力学指南 | 固态电解质实战之性质预测篇:《使用深度势能分子动力学进行固态电解质研究实战》
Wentao
Wentao
ZhexuanS