

ddd!DPA-1遇见指南 | 固态电解质实战之模型训练篇
©️ Copyright 2023 @ Authors
作者: 宋哲轩📨
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:你可以点击界面上方蓝色按钮 开始连接 ,选择 `deepmd-kit:2.2.1-cuda11.6` 镜像及`c12_m92_1 * NVIDIA V100`节点配置,稍等片刻即可运行。
*本教程需要具备基础知识:DeePMD-kit的使用,如果不熟悉的同学可以优先阅读《快速开始 DeePMD-kit|训练甲烷深度势能分子动力学模型》
🎯 欢迎阅读《ddd!DPA-1遇见指南 | 固态电解质实战之模型训练篇》
本指南将以结合DPA-1原文arXiv:2208.08236介绍DPA-1模型的研究背景、基本原理,并提供实用的代码示例,帮助理解重要的参数含义; 以固态电解质为例,手把手带你使用文献J. Chem. Phys. 154, 094703 (2021)中的训练集,训练DPA-1势函数模型。
小A:排除精度不够的传统经典动力学(CMD,classical molecular dynamics)模拟🙅♀️
小B:排除精度很高,但是算不动大体系、长时间的从头算分子动力学(AIMD, ab initio molecular dynamics)🙅♀️
你:自然要结合当下最火热,也是集效率和精度于一身的机器学习势函数分子动力学模拟(MLMD, machine learning molecular dynamics)展开🙌
- 一些通用模型应用场景局限,化学空间范围小
- 可以通过dpgen等工具获取丰富构象的数据集重新训练得到模型,但是花费较高
📣 近日,深势科技以及北京科学智能研究院研究员张铎、毕航睿等人和合作者在arXiv上预发表了名为《DPA-1: Pretraining of Attention-based Deep Potential Model for Molecular Simulation》的文章
通过对元素类型更优的编码以及利用关键的注意力机制,极大提高了Deep Potential之前版本模型的容量和迁移能力,获得了覆盖元素周期表大多常见元素的大型预训练模型。在不同数据集上的迁移学习结果表明,模型能大幅降低新场景对数据的依赖。更多细节可见微信推送和原文。DPA-1的训练和分子动力学模拟功能均已在DeepModeling开源社区DeePMD-kit项目开源。相关工作在深势科技科学计算云平台Bohrium上完成。
1. 学习目标
- DPA-1基本原理和应用;
- 以固态电解质为例,进行DPA-1势函数模型训练实战:输入脚本解读;从头训 vs. 已有预训练模型微调;模型评估测试
2. DPA-1简介
👂 迫不及待动手实践?可直接跳转至第3节~
2.1 研究背景
一直以来,势函数训练都在追求精度和效率的平衡。使用经典力场势函数方便快捷,但模拟精度难以更上一层楼;使用近来火热的AIMD(从头算分子动力学),势函数精度获得大幅提升,但计算资源花费难以在大体系、长时间的场景落地。随着AI for science的发展,机器学习手段使得训练高精度、高效率的势函数成为可能(图 1. 分子动力学模拟对比)。在MLMD的新范式下,量子化学计算(QM)不再直接应用于AIMD,而是作为生成机器学习势函数(MLP)的数据集准备; 当然,AIMD的计算结果也可以作为初始数据集。
2.2 研究方法
DPA-1模型是基于DP系列模型的一次全面升级,利用关键的门控注意力机制(Gated Attention Machanism),对原子间的相互作用实现了更为充分的建模,通过在现有数据上的训练,能够学习到更多隐藏的原子交互信息,极大提升了模型在包含不同构象、不同组分的数据集之间的迁移能力,从而也提升了在数据生成时的采样效率;并且模型通过对元素信息的编码,拓展了对元素的容量。开发者将模型在含有56种元素的较大数据集上进行了预训练,并将此预训练模型在各种下游任务上进行了迁移学习,实验表明,此预训练模型能大幅降低下游任务训练所需数据量及训练成本、提高模型预测精度,从而对分子模拟相关领域产生深远的影响(图2 DPA-1模型结构示意图)。
- 描述符:【元素类型编码】新增了原子类型作为具备嵌入矩阵的输入;引入【注意力机制】,根据原子的距离和角度重新加权得到原子之间的相互作用
- 损失函数的计算调整:为了使用新的数据集对预训练模型进行微调,首先使用新数据集的新统计结果改变预训练模型的能量偏差,然后修复预训练模型的部分参数并训练剩余参数
2.3 实验验证
- 三元合金数据集
- 固态电解质(SSE)数据集
- 高熵合金(HEA)数据集
注:OC20数据集:由物理结合到催化剂表面的单一吸附物(小分子)组成,催化剂表面覆盖有 56 种元素的周期性体相材料
(图3. DeepPot-SE 和 DPA-1 在不同设置和不同系统上的能量和力的学习曲线)。
样本效率测试:案例场景(图4. 模型样本效率表现)
2.4 模型可解释性
2.5 未来展望
3. 固态电解质实战: DPA-1势函数训练
学习了理论知识后,让我们直接动手实践吧! 本节,我们将以固态电解质数据集LiGePS-SSE-PBE为例,开展DPA-1的从头训和微调训练。
3.1 数据集下载
fatal: destination path 'study_examples' already exists and is not an empty directory.
. ├── DeePMD-SSE.ipynb ├── dpa │ ├── checkpoint │ ├── dpa.pb │ ├── input.json │ ├── iter.000000 │ │ └── 02.fp │ ├── iter.000001 │ │ └── 02.fp │ └── iter.000002 │ └── 02.fp ├── dpa-sse.ipynb └── dpa_finetune ├── input.json ├── iter.000000 │ └── 02.fp ├── iter.000001 │ └── 02.fp └── iter.000002 └── 02.fp 14 directories, 6 files
3.2 输入脚本准备(从头训)
{ "model": { "descriptor": { "type": "se_atten", "sel": 60, "rcut_smth": 0.5, "rcut": 6.0, "neuron": [ 25, 50, 100 ], "resnet_dt": false, "axis_neuron": 16, "attn": 128, "attn_layer": 1, "attn_dotr": true, "attn_mask": false, "seed": 1801819940, "_activation_function": "tanh" }, "fitting_net": { "neuron": [ 240, 240, 240 ], "resnet_dt": true, "_coord_norm": true, "_type_fitting_net": false, "seed": 2375417769, "_activation_function": "tanh" }, "type_map": [ "Li", "Ge", "P", "S" ] }, "learning_rate": { "type": "exp", "start_lr": 0.001, "decay_steps": 50, "stop_lr": 3.51e-08 }, "loss": { "start_pref_e": 0.02, "limit_pref_e": 1, "start_pref_f": 1000, "limit_pref_f": 1, "start_pref_v": 0, "limit_pref_v": 0 }, "training": { "training_data": { "systems": [ "iter.000000/02.fp/data.000", "iter.000000/02.fp/data.001", "iter.000000/02.fp/data.002", "iter.000000/02.fp/data.003" ], "batch_size": 1 }, "validation_data": { "systems": [ "iter.000001/02.fp/data.000", "iter.000001/02.fp/data.001", "iter.000001/02.fp/data.002", "iter.000001/02.fp/data.003", "iter.000002/02.fp/data.000", "iter.000002/02.fp/data.001", "iter.000002/02.fp/data.002", "iter.000002/02.fp/data.003" ], "batch_size": 1 }, "numb_steps": 10000, "seed": 3982377700, "_comment": "that's all", "disp_file": "lcurve.out", "disp_freq": 100, "numb_test": 1, "save_freq": 2000, "save_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json" } }
"descriptor": {
"type": "se_atten",
"rcut_smth": 0.5,
"rcut": 6.0,
"sel": 60,
"neuron": [25,50,100],
"axis_neuron": 16,
"resnet_dt": false,
"attn": 128,
"attn_layer": 2,
"attn_mask": false,
"attn_dotr": true,
"seed": 1801819940,
"_activation_function": "tanh"
相比于之前大家常用的 se_e2_a 描述子来说,有以下几个参数有区别:
:嵌入矩阵的子矩阵大小,即DeepPot-SE论文中的the axis matrixresnet_dt
:代表是否对attention权重点乘相对坐标的乘积,类似一个门控注意力机制(Gated Attention Mechanism) 其他参数和大家常用的“se_e2_a”描述子中代表的含义保持一致,大家可以参考这里来获得更详细的解释。
其次,对于DPA-1来说,会默认启用元素类型编码(type embedding)来编码元素相关的信息,扩大模型对元素类型的容量,默认参数如下:
"neuron": [2, 4, 8],
"resnet_dt": false,
"seed": 1
"type_map": [
3.3 模型训练(从头训)
2023-07-22 16:55:05.785055: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-22 16:55:06.825439: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 16:55:06.825532: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 16:55:06.825552: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module. /opt/conda/lib/python3.8/site-packages/deepmd/utils/compat.py:358: UserWarning: The argument training->numb_test has been deprecated since v2.0.0. Use training->validation_data->batch_size instead. warnings.warn( DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step) DEEPMD INFO training data with min nbor dist: 1.7120899465949608 DEEPMD INFO training data with max nbor size: [56] DEEPMD INFO _____ _____ __ __ _____ _ _ _ DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| | DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_ DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __| DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_ DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__| DEEPMD INFO Please read and cite: DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018) DEEPMD INFO installed to: /deepmd-kit/_skbuild/linux-x86_64-3.8/cmake-install DEEPMD INFO source : v2.2.0.b0-77-gc1299196 DEEPMD INFO source brach: devel DEEPMD INFO source commit: c1299196 DEEPMD INFO source commit at: 2023-02-28 09:06:04 +0800 DEEPMD INFO build float prec: double DEEPMD INFO build variant: cpu DEEPMD INFO build with tf inc: /opt/conda/lib/python3.8/site-packages/tensorflow/include;/opt/conda/lib/python3.8/site-packages/tensorflow/include DEEPMD INFO build with tf lib: DEEPMD INFO ---Summary of the training--------------------------------------- DEEPMD INFO running on: bohrium-13387-1030178 DEEPMD INFO computing device: gpu:0 DEEPMD INFO CUDA_VISIBLE_DEVICES: unset DEEPMD INFO Count of visible GPU: 1 DEEPMD INFO num_intra_threads: 0 DEEPMD INFO num_inter_threads: 0 DEEPMD INFO ----------------------------------------------------------------- DEEPMD INFO ---Summary of DataSystem: training ----------------------------------------------- DEEPMD INFO found 4 system(s): DEEPMD INFO system natoms bch_sz n_bch prob pbc DEEPMD INFO iter.000000/02.fp/data.000 400 1 127 0.241 T DEEPMD INFO iter.000000/02.fp/data.001 400 1 131 0.248 T DEEPMD INFO iter.000000/02.fp/data.002 400 1 133 0.252 T DEEPMD INFO iter.000000/02.fp/data.003 400 1 137 0.259 T DEEPMD INFO -------------------------------------------------------------------------------------- DEEPMD INFO ---Summary of DataSystem: validation ----------------------------------------------- DEEPMD INFO found 8 system(s): DEEPMD INFO system natoms bch_sz n_bch prob pbc DEEPMD INFO iter.000001/02.fp/data.000 400 1 117 0.111 T DEEPMD INFO iter.000001/02.fp/data.001 400 1 137 0.129 T DEEPMD INFO iter.000001/02.fp/data.002 400 1 137 0.129 T DEEPMD INFO iter.000001/02.fp/data.003 400 1 138 0.130 T DEEPMD INFO iter.000002/02.fp/data.000 400 1 133 0.126 T DEEPMD INFO iter.000002/02.fp/data.001 400 1 133 0.126 T DEEPMD INFO iter.000002/02.fp/data.002 400 1 134 0.127 T DEEPMD INFO iter.000002/02.fp/data.003 400 1 129 0.122 T DEEPMD INFO -------------------------------------------------------------------------------------- DEEPMD INFO training without frame parameter DEEPMD INFO data stating... (this step may take long time) DEEPMD INFO built lr DEEPMD INFO built network DEEPMD INFO built training WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. DEEPMD INFO initialize model from scratch DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 50, decay_rate 0.950006, final lr will be 3.51e-08 DEEPMD INFO batch 100 training time 8.37 s, testing time 0.05 s DEEPMD INFO batch 200 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 300 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 400 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 500 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 600 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 700 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 800 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 900 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1000 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 1100 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1200 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 1300 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1400 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1500 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1600 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1700 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 1800 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 1900 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 2000 training time 6.19 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 2100 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2200 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2300 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2400 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 2600 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 2700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 2800 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 2900 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3000 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3100 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 3200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3800 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 3900 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 4000 training time 6.17 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 4100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4500 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 4600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4800 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 4900 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 5000 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 5100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 5200 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 5300 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 5400 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 5500 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 5600 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 5700 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 5800 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 5900 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 6000 training time 6.18 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 6100 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 6200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6400 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 6500 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 6600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6800 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6900 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7000 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7200 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 7300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7500 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7700 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 7800 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 7900 training time 6.23 s, testing time 0.05 s DEEPMD INFO batch 8000 training time 6.24 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 8100 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 8200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8400 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 8500 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 8600 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 8700 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 8800 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 8900 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 9000 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 9100 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 9200 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 9300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9600 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 9700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9800 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 9900 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 10000 training time 6.16 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO average training time: 0.0618 s/batch (exclude first 100 batches) DEEPMD INFO finished training DEEPMD INFO wall time: 635.323 s 2023-07-22 17:05:57.559565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-22 17:05:58.573920: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:05:58.574019: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:05:58.574036: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. DEEPMD INFO The following nodes will be frozen: ['model_type', 'descrpt_attr/rcut', 'descrpt_attr/ntypes', 'model_attr/tmap', 'model_attr/model_type', 'model_attr/model_version', 'train_attr/min_nbor_dist', 'train_attr/training_script', 'o_energy', 'o_force', 'o_virial', 'o_atom_energy', 'o_atom_virial', 'fitting_attr/dfparam', 'fitting_attr/daparam'] WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/entrypoints/freeze.py:354: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/entrypoints/freeze.py:354: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/convert_to_constants.py:943: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/convert_to_constants.py:943: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. DEEPMD INFO 1334 ops in the final graph.
3.4 模型微调训练
{ "model": { "type_embedding":{"trainable": true}, "descriptor": {"trainable": true}, "fitting_net": {"trainable": true}, "type_map": [ "Li", "Ge", "P", "S" ] }, "learning_rate": { "type": "exp", "start_lr": 0.001, "decay_steps": 50, "stop_lr": 3.51e-08 }, "loss": { "type": "ener", "start_pref_e": 0.02, "limit_pref_e": 1, "start_pref_f": 1000, "limit_pref_f": 1, "start_pref_v": 0, "limit_pref_v": 0 }, "training": { "training_data": { "systems": [ "iter.000000/02.fp/data.000", "iter.000000/02.fp/data.001", "iter.000000/02.fp/data.002", "iter.000000/02.fp/data.003" ], "batch_size": 1 }, "validation_data": { "systems": [ "iter.000001/02.fp/data.000", "iter.000001/02.fp/data.001", "iter.000001/02.fp/data.002", "iter.000001/02.fp/data.003", "iter.000002/02.fp/data.000", "iter.000002/02.fp/data.001", "iter.000002/02.fp/data.002", "iter.000002/02.fp/data.003" ], "batch_size": 1 }, "numb_steps": 10000, "seed": 3982377700, "_comment": "that's all", "disp_file": "lcurve.out", "disp_freq": 100, "numb_test": 1, "save_freq": 2000, "save_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json" } }
"model": {
"type_embedding":{"trainable": true},
"descriptor": {"trainable": true},
"fitting_net": {"trainable": true},
微调训练的命令增加了--finetune dpa.pb
2023-07-22 17:08:43.230841: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-22 17:08:44.271205: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:08:44.271303: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:08:44.271321: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module. DEEPMD INFO Change the model configurations according to the pretrained one... DEEPMD INFO Change the 'descriptor' from {'trainable': True} to {'type': 'se_atten', 'sel': 60, 'rcut_smth': 0.5, 'rcut': 6.0, 'neuron': [25, 50, 100], 'resnet_dt': False, 'axis_neuron': 16, 'attn': 128, 'attn_layer': 2, 'attn_dotr': True, 'attn_mask': False, 'seed': 1801819940, 'activation_function': 'tanh', 'type_one_side': False, 'precision': 'default', 'trainable': True, 'exclude_types': []}. DEEPMD INFO Change the 'fitting_net' from {'trainable': True} to {'neuron': [240, 240, 240], 'resnet_dt': True, 'seed': 2375417769, 'type': 'ener', 'numb_fparam': 0, 'numb_aparam': 0, 'activation_function': 'tanh', 'precision': 'default', 'trainable': True, 'rcond': 0.001, 'atom_ener': [], 'use_aparam_as_mask': False}. /opt/conda/lib/python3.8/site-packages/deepmd/utils/compat.py:358: UserWarning: The argument training->numb_test has been deprecated since v2.0.0. Use training->validation_data->batch_size instead. warnings.warn( DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step) DEEPMD INFO training data with min nbor dist: 1.7120899465949608 DEEPMD INFO training data with max nbor size: [56] DEEPMD INFO _____ _____ __ __ _____ _ _ _ DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| | DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_ DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __| DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_ DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__| DEEPMD INFO Please read and cite: DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018) DEEPMD INFO installed to: /deepmd-kit/_skbuild/linux-x86_64-3.8/cmake-install DEEPMD INFO source : v2.2.0.b0-77-gc1299196 DEEPMD INFO source brach: devel DEEPMD INFO source commit: c1299196 DEEPMD INFO source commit at: 2023-02-28 09:06:04 +0800 DEEPMD INFO build float prec: double DEEPMD INFO build variant: cpu DEEPMD INFO build with tf inc: /opt/conda/lib/python3.8/site-packages/tensorflow/include;/opt/conda/lib/python3.8/site-packages/tensorflow/include DEEPMD INFO build with tf lib: DEEPMD INFO ---Summary of the training--------------------------------------- DEEPMD INFO running on: bohrium-13387-1030178 DEEPMD INFO computing device: gpu:0 DEEPMD INFO CUDA_VISIBLE_DEVICES: unset DEEPMD INFO Count of visible GPU: 1 DEEPMD INFO num_intra_threads: 0 DEEPMD INFO num_inter_threads: 0 DEEPMD INFO ----------------------------------------------------------------- DEEPMD INFO ---Summary of DataSystem: training ----------------------------------------------- DEEPMD INFO found 4 system(s): DEEPMD INFO system natoms bch_sz n_bch prob pbc DEEPMD INFO iter.000000/02.fp/data.000 400 1 127 0.241 T DEEPMD INFO iter.000000/02.fp/data.001 400 1 131 0.248 T DEEPMD INFO iter.000000/02.fp/data.002 400 1 133 0.252 T DEEPMD INFO iter.000000/02.fp/data.003 400 1 137 0.259 T DEEPMD INFO -------------------------------------------------------------------------------------- DEEPMD INFO ---Summary of DataSystem: validation ----------------------------------------------- DEEPMD INFO found 8 system(s): DEEPMD INFO system natoms bch_sz n_bch prob pbc DEEPMD INFO iter.000001/02.fp/data.000 400 1 117 0.111 T DEEPMD INFO iter.000001/02.fp/data.001 400 1 137 0.129 T DEEPMD INFO iter.000001/02.fp/data.002 400 1 137 0.129 T DEEPMD INFO iter.000001/02.fp/data.003 400 1 138 0.130 T DEEPMD INFO iter.000002/02.fp/data.000 400 1 133 0.126 T DEEPMD INFO iter.000002/02.fp/data.001 400 1 133 0.126 T DEEPMD INFO iter.000002/02.fp/data.002 400 1 134 0.127 T DEEPMD INFO iter.000002/02.fp/data.003 400 1 129 0.122 T DEEPMD INFO -------------------------------------------------------------------------------------- DEEPMD INFO training without frame parameter DEEPMD INFO Changing energy bias in pretrained model for types ['Li', 'Ge', 'P', 'S']... (this step may take long time) WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. DEEPMD INFO Adjust batch size from 1024 to 2048 DEEPMD INFO Adjust batch size from 2048 to 4096 DEEPMD INFO Adjust batch size from 4096 to 8192 DEEPMD INFO RMSE of atomic energy after linear regression is: 0.0018085361367408837 eV/atom. DEEPMD INFO Change energy bias of ['Li', 'Ge', 'P', 'S'] from [-4.17483491 -0.41748349 -0.83496698 -5.00980189] to [-4.1760068 -0.41760068 -0.83520136 -5.01120816]. DEEPMD INFO built lr DEEPMD INFO built network DEEPMD INFO built training WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. DEEPMD INFO initialize training from the frozen pretrained model DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 50, decay_rate 0.950006, final lr will be 3.51e-08 DEEPMD INFO batch 100 training time 8.34 s, testing time 0.05 s DEEPMD INFO batch 200 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 300 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 400 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 500 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 600 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 700 training time 6.22 s, testing time 0.05 s DEEPMD INFO batch 800 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 900 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 1000 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1100 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1200 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1300 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1400 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1500 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 1600 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1700 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1800 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 1900 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 2000 training time 6.20 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 2100 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 2200 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 2300 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2400 training time 6.20 s, testing time 0.05 s DEEPMD INFO batch 2500 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2600 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 2700 training time 6.21 s, testing time 0.05 s DEEPMD INFO batch 2800 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 2900 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3000 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3100 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 3200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3800 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 3900 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 4000 training time 6.16 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 4100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4200 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 4300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4500 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 4600 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 4700 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 4800 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 4900 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 5000 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 5100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 5200 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 5300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 5400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 5500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 5600 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 5700 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 5800 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 5900 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 6000 training time 6.16 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 6100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6400 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 6500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 6700 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 6800 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 6900 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7000 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7100 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7200 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7300 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7500 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 7600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 7700 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7800 training time 6.15 s, testing time 0.05 s DEEPMD INFO batch 7900 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8000 training time 6.15 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO batch 8100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8300 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8400 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8600 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8700 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 8800 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 8900 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9000 training time 6.19 s, testing time 0.05 s DEEPMD INFO batch 9100 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9200 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9300 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 9400 training time 6.18 s, testing time 0.05 s DEEPMD INFO batch 9500 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9600 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 9700 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 9800 training time 6.17 s, testing time 0.05 s DEEPMD INFO batch 9900 training time 6.16 s, testing time 0.05 s DEEPMD INFO batch 10000 training time 6.18 s, testing time 0.05 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO average training time: 0.0617 s/batch (exclude first 100 batches) DEEPMD INFO finished training DEEPMD INFO wall time: 634.868 s 2023-07-22 17:19:41.065653: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-22 17:19:42.102660: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:19:42.102767: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:19:42.102785: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. DEEPMD INFO The following nodes will be frozen: ['model_type', 'descrpt_attr/rcut', 'descrpt_attr/ntypes', 'model_attr/tmap', 'model_attr/model_type', 'model_attr/model_version', 'train_attr/min_nbor_dist', 'train_attr/training_script', 'o_energy', 'o_force', 'o_virial', 'o_atom_energy', 'o_atom_virial', 'fitting_attr/dfparam', 'fitting_attr/daparam'] WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/entrypoints/freeze.py:354: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/entrypoints/freeze.py:354: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/convert_to_constants.py:943: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/convert_to_constants.py:943: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2. DEEPMD INFO 1336 ops in the final graph.
- input.json的参数设置
- 命令行增加
--fintune pretained.pb
3.5 模型检验
现在,我们已经完成了固态电解质dpa势函数的从头训和微调训练。让我们通过lcurve(学习率变化曲线输出文件)和dp test功能检验下模型的表现吧!
# step rmse_val rmse_trn rmse_e_val rmse_e_trn rmse_f_val rmse_f_trn lr 0 2.77e+01 2.71e+01 2.01e+00 2.03e+00 8.58e-01 8.36e-01 1.0e-03 100 1.83e+01 1.62e+01 8.53e-02 5.69e-02 6.07e-01 5.38e-01 9.0e-04 200 9.90e+00 9.35e+00 1.78e-01 1.78e-01 3.42e-01 3.23e-01 8.1e-04 300 7.22e+00 6.22e+00 9.65e-02 1.05e-01 2.64e-01 2.26e-01 7.4e-04 400 6.61e+00 5.67e+00 7.50e-03 1.31e-02 2.57e-01 2.20e-01 6.6e-04 500 5.75e+00 5.45e+00 5.07e-02 4.98e-02 2.34e-01 2.21e-01 6.0e-04 600 5.83e+00 4.90e+00 1.72e-03 3.21e-03 2.51e-01 2.11e-01 5.4e-04 700 3.94e+00 4.28e+00 2.77e-02 3.05e-02 1.77e-01 1.93e-01 4.9e-04 800 4.64e+00 4.28e+00 6.33e-02 6.20e-02 2.16e-01 1.99e-01 4.4e-04 900 4.17e+00 4.59e+00 2.14e-02 1.86e-02 2.09e-01 2.30e-01 4.0e-04 1000 4.24e+00 3.72e+00 2.79e-02 2.48e-02 2.22e-01 1.95e-01 3.6e-04 1100 3.41e+00 3.72e+00 4.79e-02 4.03e-02 1.84e-01 2.03e-01 3.2e-04 1200 3.17e+00 3.45e+00 1.29e-02 1.99e-02 1.85e-01 2.00e-01 2.9e-04 1300 3.46e+00 3.29e+00 8.73e-02 7.90e-02 1.92e-01 1.84e-01 2.6e-04 1400 3.06e+00 3.19e+00 5.29e-02 4.67e-02 1.89e-01 2.00e-01 2.4e-04 1500 2.89e+00 2.60e+00 7.33e-03 9.06e-03 1.96e-01 1.77e-01 2.1e-04 1600 2.75e+00 2.56e+00 5.06e-03 1.38e-04 1.97e-01 1.84e-01 1.9e-04 1700 2.44e+00 2.52e+00 1.66e-02 1.52e-02 1.82e-01 1.89e-01 1.7e-04 1800 1.87e+00 2.47e+00 2.41e-02 7.82e-03 1.44e-01 1.96e-01 1.6e-04 1900 2.24e+00 2.21e+00 1.16e-02 1.34e-02 1.86e-01 1.84e-01 1.4e-04 2000 2.03e+00 1.88e+00 1.75e-02 2.10e-02 1.76e-01 1.62e-01 1.3e-04 2100 1.95e+00 1.91e+00 1.21e-02 1.50e-02 1.79e-01 1.75e-01 1.2e-04 2200 1.89e+00 1.99e+00 1.06e-02 1.47e-02 1.83e-01 1.92e-01 1.0e-04 2300 1.81e+00 1.78e+00 9.27e-03 6.88e-03 1.85e-01 1.82e-01 9.4e-05 2400 1.63e+00 1.65e+00 1.87e-02 1.64e-02 1.72e-01 1.75e-01 8.5e-05 2500 1.76e+00 1.67e+00 1.97e-02 2.27e-02 1.95e-01 1.82e-01 7.7e-05 2600 1.50e+00 1.51e+00 9.35e-03 6.31e-03 1.78e-01 1.79e-01 6.9e-05 2700 1.47e+00 1.22e+00 1.93e-03 4.47e-03 1.84e-01 1.53e-01 6.3e-05 2800 1.44e+00 1.44e+00 1.51e-02 1.35e-02 1.86e-01 1.87e-01 5.7e-05 2900 1.29e+00 1.42e+00 1.09e-02 8.51e-03 1.77e-01 1.96e-01 5.1e-05 3000 1.30e+00 1.28e+00 8.11e-03 7.30e-03 1.88e-01 1.85e-01 4.6e-05 3100 1.07e+00 1.21e+00 3.57e-03 6.98e-03 1.64e-01 1.84e-01 4.2e-05 3200 1.12e+00 1.07e+00 4.16e-03 2.35e-03 1.80e-01 1.73e-01 3.8e-05 3300 1.22e+00 1.05e+00 2.06e-03 3.05e-03 2.07e-01 1.78e-01 3.4e-05 3400 1.04e+00 1.00e+00 1.11e-02 5.72e-03 1.82e-01 1.77e-01 3.1e-05 3500 8.83e-01 9.68e-01 4.07e-03 4.30e-03 1.65e-01 1.80e-01 2.8e-05 3600 9.95e-01 9.55e-01 1.16e-02 7.57e-03 1.90e-01 1.85e-01 2.5e-05 3700 8.91e-01 8.59e-01 9.37e-03 1.26e-02 1.80e-01 1.70e-01 2.2e-05 3800 9.13e-01 8.02e-01 9.55e-04 3.18e-03 1.98e-01 1.73e-01 2.0e-05 3900 8.30e-01 7.84e-01 8.00e-03 4.25e-03 1.85e-01 1.77e-01 1.8e-05 4000 7.63e-01 8.00e-01 1.97e-03 1.83e-03 1.82e-01 1.91e-01 1.7e-05 4100 7.43e-01 7.65e-01 8.91e-03 1.14e-02 1.81e-01 1.83e-01 1.5e-05 4200 7.29e-01 6.09e-01 1.04e-02 7.09e-03 1.84e-01 1.56e-01 1.3e-05 4300 7.21e-01 7.04e-01 1.78e-03 6.26e-03 1.99e-01 1.91e-01 1.2e-05 4400 6.60e-01 6.26e-01 1.30e-03 5.52e-03 1.91e-01 1.78e-01 1.1e-05 4500 6.25e-01 5.83e-01 6.60e-03 3.96e-03 1.85e-01 1.75e-01 9.9e-06 4600 5.86e-01 5.10e-01 3.30e-04 5.71e-03 1.86e-01 1.58e-01 8.9e-06 4700 5.21e-01 5.50e-01 6.89e-03 5.96e-03 1.67e-01 1.78e-01 8.1e-06 4800 5.68e-01 5.41e-01 3.52e-03 1.32e-03 1.96e-01 1.88e-01 7.3e-06 4900 4.90e-01 5.28e-01 4.22e-03 6.94e-03 1.76e-01 1.86e-01 6.6e-06 5000 4.88e-01 4.61e-01 2.72e-03 1.98e-03 1.84e-01 1.75e-01 5.9e-06 5100 4.75e-01 4.26e-01 6.41e-04 2.15e-03 1.88e-01 1.68e-01 5.3e-06 5200 4.48e-01 4.38e-01 5.80e-03 1.03e-02 1.80e-01 1.61e-01 4.8e-06 5300 4.53e-01 4.28e-01 6.71e-03 1.81e-03 1.87e-01 1.85e-01 4.4e-06 5400 4.24e-01 4.13e-01 2.37e-03 8.79e-04 1.90e-01 1.86e-01 3.9e-06 5500 4.34e-01 4.11e-01 4.81e-04 3.49e-03 2.04e-01 1.90e-01 3.5e-06 5600 3.85e-01 3.41e-01 2.10e-04 4.40e-03 1.88e-01 1.61e-01 3.2e-06 5700 3.46e-01 3.53e-01 3.73e-03 3.43e-03 1.72e-01 1.76e-01 2.9e-06 5800 3.55e-01 3.47e-01 2.80e-03 5.82e-04 1.85e-01 1.83e-01 2.6e-06 5900 3.28e-01 3.29e-01 1.18e-03 1.85e-03 1.79e-01 1.79e-01 2.4e-06 6000 3.09e-01 3.30e-01 3.07e-03 1.18e-03 1.71e-01 1.86e-01 2.1e-06 6100 3.23e-01 3.17e-01 6.21e-03 2.25e-03 1.75e-01 1.84e-01 1.9e-06 6200 3.14e-01 3.30e-01 3.57e-04 3.39e-03 1.90e-01 1.96e-01 1.7e-06 6300 3.07e-01 3.15e-01 9.20e-04 3.24e-03 1.92e-01 1.93e-01 1.6e-06 6400 2.97e-01 2.99e-01 1.34e-03 5.33e-03 1.91e-01 1.80e-01 1.4e-06 6500 2.72e-01 2.96e-01 2.25e-03 6.45e-03 1.78e-01 1.77e-01 1.3e-06 6600 2.73e-01 2.93e-01 1.84e-03 2.25e-03 1.85e-01 1.98e-01 1.1e-06 6700 2.86e-01 2.77e-01 4.48e-03 5.85e-03 1.90e-01 1.76e-01 1.0e-06 6800 2.56e-01 2.75e-01 1.36e-03 2.78e-03 1.83e-01 1.93e-01 9.3e-07 6900 2.60e-01 2.64e-01 2.62e-03 2.45e-03 1.88e-01 1.91e-01 8.4e-07 7000 2.40e-01 2.35e-01 3.99e-03 4.34e-03 1.71e-01 1.64e-01 7.6e-07 7100 2.40e-01 2.54e-01 1.74e-03 3.09e-03 1.83e-01 1.90e-01 6.9e-07 7200 2.18e-01 2.19e-01 1.32e-03 1.63e-03 1.70e-01 1.70e-01 6.2e-07 7300 2.31e-01 2.30e-01 3.02e-03 2.25e-03 1.79e-01 1.80e-01 5.6e-07 7400 2.31e-01 2.41e-01 3.03e-03 1.80e-03 1.82e-01 1.94e-01 5.1e-07 7500 2.24e-01 2.41e-01 1.50e-04 2.81e-03 1.86e-01 1.94e-01 4.6e-07 7600 2.29e-01 2.35e-01 1.44e-03 1.79e-03 1.91e-01 1.96e-01 4.1e-07 7700 2.11e-01 1.94e-01 6.94e-04 2.66e-03 1.80e-01 1.59e-01 3.7e-07 7800 2.09e-01 2.20e-01 2.60e-03 1.23e-03 1.75e-01 1.89e-01 3.4e-07 7900 2.20e-01 1.96e-01 3.74e-03 3.86e-03 1.82e-01 1.58e-01 3.0e-07 8000 2.14e-01 2.81e-01 3.20e-03 1.18e-02 1.81e-01 1.35e-01 2.7e-07 8100 2.14e-01 2.14e-01 3.17e-03 8.66e-05 1.83e-01 1.92e-01 2.5e-07 8200 2.46e-01 2.11e-01 7.13e-03 4.51e-04 1.81e-01 1.91e-01 2.2e-07 8300 2.13e-01 2.05e-01 2.08e-03 2.95e-03 1.90e-01 1.79e-01 2.0e-07 8400 2.06e-01 1.96e-01 3.87e-03 2.31e-04 1.76e-01 1.80e-01 1.8e-07 8500 1.93e-01 2.19e-01 8.70e-04 2.88e-03 1.78e-01 1.96e-01 1.6e-07 8600 2.18e-01 2.14e-01 2.43e-03 8.26e-04 1.98e-01 1.99e-01 1.5e-07 8700 2.12e-01 2.06e-01 2.30e-03 3.80e-03 1.95e-01 1.80e-01 1.3e-07 8800 2.31e-01 2.02e-01 6.19e-03 3.20e-03 1.84e-01 1.81e-01 1.2e-07 8900 1.93e-01 1.79e-01 2.89e-03 4.03e-04 1.75e-01 1.70e-01 1.1e-07 9000 1.82e-01 1.93e-01 9.33e-04 1.00e-03 1.73e-01 1.83e-01 9.8e-08 9100 1.83e-01 2.22e-01 1.07e-03 2.42e-03 1.74e-01 2.07e-01 8.8e-08 9200 1.93e-01 1.95e-01 3.45e-03 2.74e-03 1.73e-01 1.80e-01 8.0e-08 9300 2.07e-01 1.89e-01 4.09e-03 2.20e-05 1.83e-01 1.83e-01 7.2e-08 9400 2.10e-01 2.30e-01 1.47e-04 3.88e-03 2.03e-01 2.10e-01 6.5e-08 9500 2.28e-01 1.81e-01 6.78e-03 1.27e-03 1.78e-01 1.74e-01 5.9e-08 9600 2.00e-01 1.93e-01 1.75e-03 4.21e-03 1.92e-01 1.69e-01 5.3e-08 9700 1.90e-01 1.69e-01 2.14e-03 2.24e-03 1.81e-01 1.59e-01 4.8e-08 9800 1.85e-01 2.30e-01 1.29e-04 5.23e-03 1.81e-01 2.01e-01 4.3e-08 9900 1.88e-01 1.97e-01 3.34e-03 2.92e-03 1.73e-01 1.85e-01 3.9e-08 10000 1.79e-01 2.07e-01 1.66e-03 3.54e-03 1.73e-01 1.91e-01 3.5e-08
['dpa/lcurve.out', 'dpa_finetune/lcurve.out']


2023-07-22 17:21:20.104037: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-22 17:21:21.129529: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:21:21.129622: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:21:21.129639: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.000 DEEPMD INFO Adjust batch size from 1024 to 2048 DEEPMD INFO Adjust batch size from 2048 to 4096 DEEPMD INFO Adjust batch size from 4096 to 8192 DEEPMD INFO Adjust batch size from 8192 to 16384 DEEPMD INFO Adjust batch size from 16384 to 32768 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 9.427785e-01 eV DEEPMD INFO Energy RMSE : 1.196250e+00 eV DEEPMD INFO Energy MAE/Natoms : 2.356946e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.990626e-03 eV DEEPMD INFO Force MAE : 1.336536e-01 eV/A DEEPMD INFO Force RMSE : 1.815210e-01 eV/A DEEPMD INFO Virial MAE : 9.645088e+00 eV DEEPMD INFO Virial RMSE : 1.431580e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.411272e-02 eV DEEPMD INFO Virial RMSE/Natoms : 3.578950e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.001 2023-07-22 17:21:42.161708: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize/Square If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. Current allocation summary follows. Current allocation summary follows. 2023-07-22 17:21:42.161959: W tensorflow/tsl/framework/bfc_allocator.cc:492] ********************************************___*************************_*************************_* 2023-07-22 17:21:42.163651: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory 2023-07-22 17:21:52.165476: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize_1/Square If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. Current allocation summary follows. Current allocation summary follows. 2023-07-22 17:21:52.165726: W tensorflow/tsl/framework/bfc_allocator.cc:492] ********************************************___*************************_*************************_* 2023-07-22 17:21:52.165750: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory 2023-07-22 17:22:02.165943: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize_2/Square If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. Current allocation summary follows. Current allocation summary follows. 2023-07-22 17:22:02.166201: W tensorflow/tsl/framework/bfc_allocator.cc:492] ********************************************___*************************_*************************_* 2023-07-22 17:22:02.166226: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory DEEPMD INFO Adjust batch size from 32768 to 16384 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 8.434060e-01 eV DEEPMD INFO Energy RMSE : 1.056923e+00 eV DEEPMD INFO Energy MAE/Natoms : 2.108515e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.642309e-03 eV DEEPMD INFO Force MAE : 1.342740e-01 eV/A DEEPMD INFO Force RMSE : 1.823510e-01 eV/A DEEPMD INFO Virial MAE : 9.380907e+00 eV DEEPMD INFO Virial RMSE : 1.377223e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.345227e-02 eV DEEPMD INFO Virial RMSE/Natoms : 3.443056e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.002 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 1.096916e+00 eV DEEPMD INFO Energy RMSE : 1.411227e+00 eV DEEPMD INFO Energy MAE/Natoms : 2.742291e-03 eV DEEPMD INFO Energy RMSE/Natoms : 3.528067e-03 eV DEEPMD INFO Force MAE : 1.328904e-01 eV/A DEEPMD INFO Force RMSE : 1.803824e-01 eV/A DEEPMD INFO Virial MAE : 8.762420e+00 eV DEEPMD INFO Virial RMSE : 1.282206e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.190605e-02 eV DEEPMD INFO Virial RMSE/Natoms : 3.205516e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.003 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 1.005292e+00 eV DEEPMD INFO Energy RMSE : 1.240209e+00 eV DEEPMD INFO Energy MAE/Natoms : 2.513231e-03 eV DEEPMD INFO Energy RMSE/Natoms : 3.100522e-03 eV DEEPMD INFO Force MAE : 1.341954e-01 eV/A DEEPMD INFO Force RMSE : 1.820485e-01 eV/A DEEPMD INFO Virial MAE : 8.224030e+00 eV DEEPMD INFO Virial RMSE : 1.197917e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.056007e-02 eV DEEPMD INFO Virial RMSE/Natoms : 2.994793e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ----------weighted average of errors----------- DEEPMD INFO # number of systems : 4 DEEPMD INFO Energy MAE : 9.720983e-01 eV DEEPMD INFO Energy RMSE : 1.232658e+00 eV DEEPMD INFO Energy MAE/Natoms : 2.430246e-03 eV DEEPMD INFO Energy RMSE/Natoms : 3.081644e-03 eV DEEPMD INFO Force MAE : 1.337533e-01 eV/A DEEPMD INFO Force RMSE : 1.815773e-01 eV/A DEEPMD INFO Virial MAE : 9.003111e+00 eV DEEPMD INFO Virial RMSE : 1.325257e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.250778e-02 eV DEEPMD INFO Virial RMSE/Natoms : 3.313142e-02 eV DEEPMD INFO # ----------------------------------------------- 2023-07-22 17:22:21.320361: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-22 17:22:22.353270: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:22:22.353362: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-07-22 17:22:22.353379: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.000 DEEPMD INFO Adjust batch size from 1024 to 2048 DEEPMD INFO Adjust batch size from 2048 to 4096 DEEPMD INFO Adjust batch size from 4096 to 8192 DEEPMD INFO Adjust batch size from 8192 to 16384 DEEPMD INFO Adjust batch size from 16384 to 32768 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 8.621734e-01 eV DEEPMD INFO Energy RMSE : 1.057169e+00 eV DEEPMD INFO Energy MAE/Natoms : 2.155433e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.642923e-03 eV DEEPMD INFO Force MAE : 1.101241e-01 eV/A DEEPMD INFO Force RMSE : 1.489321e-01 eV/A DEEPMD INFO Virial MAE : 1.266581e+01 eV DEEPMD INFO Virial RMSE : 1.956447e+01 eV DEEPMD INFO Virial MAE/Natoms : 3.166453e-02 eV DEEPMD INFO Virial RMSE/Natoms : 4.891117e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.001 2023-07-22 17:22:43.410028: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/c_value/MatMul If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. Current allocation summary follows. Current allocation summary follows. 2023-07-22 17:22:43.410321: W tensorflow/tsl/framework/bfc_allocator.cc:492] *******************************************____*************************___******************_****_* 2023-07-22 17:22:43.410369: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at matmul_op_impl.h:731 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[1944000,128] and type double on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 2023-07-22 17:22:53.412286: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize/Square If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. Current allocation summary follows. Current allocation summary follows. 2023-07-22 17:22:53.412596: W tensorflow/tsl/framework/bfc_allocator.cc:492] *******************************************____*************************___******************_****_* 2023-07-22 17:22:53.412637: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory 2023-07-22 17:23:03.412833: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize_1/Square If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. Current allocation summary follows. Current allocation summary follows. 2023-07-22 17:23:03.413122: W tensorflow/tsl/framework/bfc_allocator.cc:492] *******************************************____*************************___******************_****_* 2023-07-22 17:23:03.413148: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory DEEPMD INFO Adjust batch size from 32768 to 16384 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 6.472647e-01 eV DEEPMD INFO Energy RMSE : 8.226297e-01 eV DEEPMD INFO Energy MAE/Natoms : 1.618162e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.056574e-03 eV DEEPMD INFO Force MAE : 1.106034e-01 eV/A DEEPMD INFO Force RMSE : 1.492427e-01 eV/A DEEPMD INFO Virial MAE : 1.168436e+01 eV DEEPMD INFO Virial RMSE : 1.808673e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.921089e-02 eV DEEPMD INFO Virial RMSE/Natoms : 4.521683e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.002 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 6.905101e-01 eV DEEPMD INFO Energy RMSE : 8.749119e-01 eV DEEPMD INFO Energy MAE/Natoms : 1.726275e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.187280e-03 eV DEEPMD INFO Force MAE : 1.094928e-01 eV/A DEEPMD INFO Force RMSE : 1.477969e-01 eV/A DEEPMD INFO Virial MAE : 1.099581e+01 eV DEEPMD INFO Virial RMSE : 1.688399e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.748951e-02 eV DEEPMD INFO Virial RMSE/Natoms : 4.220997e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : iter.000001/02.fp/data.003 DEEPMD INFO # number of test data : 100 DEEPMD INFO Energy MAE : 6.999822e-01 eV DEEPMD INFO Energy RMSE : 8.619741e-01 eV DEEPMD INFO Energy MAE/Natoms : 1.749956e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.154935e-03 eV DEEPMD INFO Force MAE : 1.101929e-01 eV/A DEEPMD INFO Force RMSE : 1.488093e-01 eV/A DEEPMD INFO Virial MAE : 1.067116e+01 eV DEEPMD INFO Virial RMSE : 1.624072e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.667789e-02 eV DEEPMD INFO Virial RMSE/Natoms : 4.060179e-02 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ----------weighted average of errors----------- DEEPMD INFO # number of systems : 4 DEEPMD INFO Energy MAE : 7.249826e-01 eV DEEPMD INFO Energy RMSE : 9.086799e-01 eV DEEPMD INFO Energy MAE/Natoms : 1.812456e-03 eV DEEPMD INFO Energy RMSE/Natoms : 2.271700e-03 eV DEEPMD INFO Force MAE : 1.101033e-01 eV/A DEEPMD INFO Force RMSE : 1.486962e-01 eV/A DEEPMD INFO Virial MAE : 1.150428e+01 eV DEEPMD INFO Virial RMSE : 1.773928e+01 eV DEEPMD INFO Virial MAE/Natoms : 2.876071e-02 eV DEEPMD INFO Virial RMSE/Natoms : 4.434820e-02 eV DEEPMD INFO # -----------------------------------------------
正在处理的文件: result_dpa.e_peratom.out 输出位置:./dp_test_dpa.png 正在处理的文件: result_finetune.e_peratom.out 输出位置:./dp_test_finetune.png


从Energy RMSE/Natoms和Force RMSE对比结果看,同样的条件下基于预训练微调后的训练的finetune模型精度更高。
ddd! 欢迎阅读:
- 深度势能分子动力学指南 | 固态电解质实战之性质预测篇:《使用深度势能分子动力学进行固态电解质研究实战》







