Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
ddd!DPA-1遇见指南 | 固态电解质实战之模型训练篇
AI4S
DeePMD
DPA-1
AI4SDeePMDDPA-1
ZhexuanS
发布于 2023-07-23
推荐镜像 :DeePMD-kit:2.2.1-cuda11.6-notebook
推荐机型 :c12_m92_1 * NVIDIA V100
赞 7
2
22
ddd!DPA-1遇见指南 | 固态电解质实战之模型训练篇
1. 学习目标
2. DPA-1简介
2.1 研究背景
2.2 研究方法
2.3 实验验证
2.4 模型可解释性
2.5 未来展望
3. 固态电解质实战: DPA-1势函数训练
3.1 数据集下载
3.2 输入脚本准备(从头训)
3.3 模型训练(从头训)
3.4 模型微调训练
3.5 模型检验
4. 参考来源

ddd!DPA-1遇见指南 | 固态电解质实战之模型训练篇

代码
文本

Open In Bohrium

代码
文本

©️ Copyright 2023 @ Authors
作者: 宋哲轩📨
日期:2023-07-20
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:你可以点击界面上方蓝色按钮 开始连接 ,选择 `deepmd-kit:2.2.1-cuda11.6` 镜像及`c12_m92_1 * NVIDIA V100`节点配置,稍等片刻即可运行。
*本教程需要具备基础知识:DeePMD-kit的使用,如果不熟悉的同学可以优先阅读《快速开始 DeePMD-kit|训练甲烷深度势能分子动力学模型》

代码
文本

🎯 欢迎阅读《ddd!DPA-1遇见指南 | 固态电解质实战之模型训练篇》

本指南将以结合DPA-1原文arXiv:2208.08236介绍DPA-1模型的研究背景、基本原理,并提供实用的代码示例,帮助理解重要的参数含义; 以固态电解质为例,手把手带你使用文献J. Chem. Phys. 154, 094703 (2021)中的训练集,训练DPA-1势函数模型。

快来和我们一起遇见DPA-1指南,开启势函数探索的新篇章吧!

代码
文本

📐一天,导师要求通过分子动力学模拟对固态电解质进行课题研究——

小A:排除精度不够的传统经典动力学(CMD,classical molecular dynamics)模拟🙅‍♀️

小B:排除精度很高,但是算不动大体系、长时间的从头算分子动力学(AIMD, ab initio molecular dynamics)🙅‍♀️

你:自然要结合当下最火热,也是集效率和精度于一身的机器学习势函数分子动力学模拟(MLMD, machine learning molecular dynamics)展开🙌

👍导师:我同意,去做吧!

在彻夜研究机器学习势函数计算后,你不禁思考:机器学习势函数是实用的,但是是否有一个马上拿来能用的模型呢?如果没有,我在训练时,能否利用一些公开的模型作为基础,稍微根据现有数据集调整一下就得到一个可靠的模型呢?

你发现,现有势函数模型/方法存在的缺陷:

  • 一些通用模型应用场景局限,化学空间范围小
  • 可以通过dpgen等工具获取丰富构象的数据集重新训练得到模型,但是花费较高

至此,你充分认识到:如果有一个大规模预训练模型/势函数就好了,它可以帮助我们省时省钱地得到一个适合应用场景势函数🚀。

代码
文本

📣 近日,深势科技以及北京科学智能研究院研究员张铎、毕航睿等人和合作者在arXiv上预发表了名为《DPA-1: Pretraining of Attention-based Deep Potential Model for Molecular Simulation》的文章

通过对元素类型更优的编码以及利用关键的注意力机制,极大提高了Deep Potential之前版本模型的容量和迁移能力,获得了覆盖元素周期表大多常见元素的大型预训练模型。在不同数据集上的迁移学习结果表明,模型能大幅降低新场景对数据的依赖。更多细节可见微信推送和原文。DPA-1的训练和分子动力学模拟功能均已在DeepModeling开源社区DeePMD-kit项目开源。相关工作在深势科技科学计算云平台Bohrium上完成。

代码
文本

现在,你已经知道了DPA-1是一个基于注意力机制的DP模型,它有效地描述了原子间相互作用;与训练后,可以显著减少下游任务地额外工作。

你只需要通过一篇指南快速掌握训练DPA-1势函数的方法(dpa_从头训),以及如何基于一个已有的大模型,根据现有数据集进行微调(dpa_finetune)得到势函数。

代码
文本

1. 学习目标

在本教程学习后,你将获得:

  1. DPA-1基本原理和应用;
  2. 以固态电解质为例,进行DPA-1势函数模型训练实战:输入脚本解读;从头训 vs. 已有预训练模型微调;模型评估测试
代码
文本

2. DPA-1简介

👂 迫不及待动手实践?可直接跳转至第3节~

代码
文本

2.1 研究背景

一直以来,势函数训练都在追求精度和效率的平衡。使用经典力场势函数方便快捷,但模拟精度难以更上一层楼;使用近来火热的AIMD(从头算分子动力学),势函数精度获得大幅提升,但计算资源花费难以在大体系、长时间的场景落地。随着AI for science的发展,机器学习手段使得训练高精度、高效率的势函数成为可能(图 1. 分子动力学模拟对比)。在MLMD的新范式下,量子化学计算(QM)不再直接应用于AIMD,而是作为生成机器学习势函数(MLP)的数据集准备; 当然,AIMD的计算结果也可以作为初始数据集。

image.png

ref. Machine learning-accelerated quantum mechanics-based atomistic simulations for industrial applications

然而,由于现有模型迁移能力不足、缺乏通用的大模型,面对一个新的复杂体系,要获得可用的、较为完备的势函数模型,科学家们基本上仍然需要获取大量计算数据并从头开始训练模型。随着电子结构数据的积累,类比计算机视觉(CV)或者自然语言处理(NLP)等其他人工智能领域的发展,**“预训练+少量数据微调”**是解决这个难题比较自然的想法。

为了实现这一范式,我们亟需一种具有强迁移能力、能容纳元素周期表大多数元素的模型结构。

代码
文本

2.2 研究方法

DPA-1模型是基于DP系列模型的一次全面升级,利用关键的门控注意力机制(Gated Attention Machanism),对原子间的相互作用实现了更为充分的建模,通过在现有数据上的训练,能够学习到更多隐藏的原子交互信息,极大提升了模型在包含不同构象、不同组分的数据集之间的迁移能力,从而也提升了在数据生成时的采样效率;并且模型通过对元素信息的编码,拓展了对元素的容量。开发者将模型在含有56种元素的较大数据集上进行了预训练,并将此预训练模型在各种下游任务上进行了迁移学习,实验表明,此预训练模型能大幅降低下游任务训练所需数据量及训练成本、提高模型预测精度,从而对分子模拟相关领域产生深远的影响(图2 DPA-1模型结构示意图)。

image.png

代码
文本

相比于之前的dp模型,DPA-1模型在研究方法上做了如下调整:

  • 描述符:【元素类型编码】新增了原子类型作为具备嵌入矩阵的输入;引入【注意力机制】,根据原子的距离和角度重新加权得到原子之间的相互作用
  • 损失函数的计算调整:为了使用新的数据集对预训练模型进行微调,首先使用新数据集的新统计结果改变预训练模型的能量偏差,然后修复预训练模型的部分参数并训练剩余参数

DPA-1在推理方面延续了DP系列模型的高效率,可以进行大规模原子、元素体系的分子动力学模拟。

代码
文本

2.3 实验验证

迁移性测试

  • 三元合金数据集
  • 固态电解质(SSE)数据集
  • 高熵合金(HEA)数据集

注:OC20数据集:由物理结合到催化剂表面的单一吸附物(小分子)组成,催化剂表面覆盖有 56 种元素的周期性体相材料

为了测试DPA-1模型结构带来的迁移能力提升,研究者人为将不同训练集划分成了多个子集,每个子集之间的组分、构型有较大差异(以AlMgCu为例,single子集中仅包含单质数据;binary仅有二元数据,即Al-Mg,Al-Cu,Mg-Cu;而ternary则是剩余的三元数据)研究者在其中一些子集上训练,在另一些子集上进行测试,来考验模型在极端条件下的迁移能力。 (图3. DeepPot-SE 和 DPA-1 在不同设置和不同系统上的能量和力的学习曲线)。 img_v2_d096f1a5-b23a-4b95-aef7-b755e66ab41g.png

可以看到,对比DeepPot-SE,在某些条件下DPA-1的测试精度甚至能实现一两个数量级的提升,这说明模型可以从现有数据中学习到隐含的原子间交互信息,也进一步证明了模型强大的迁移能力

样本效率测试:案例场景(图4. 模型样本效率表现) image.png DPA-1在仅有少量三元数据的场景下,也达到了较高的精度,对比DeepPot-SE可以节省大约90%的三元数据。

代码
文本

2.4 模型可解释性

为了进一步研究此覆盖元素周期表大多数元素的预训练模型的可解释性,研究者将模型中学习到的元素编码进行了PCA降维并可视化,如图5所示: image.png

所有的元素在隐空间中呈螺旋状分布,同周期元素沿着螺旋下降,同族元素则垂直螺旋方向分布,巧妙地对应了其在元素周期表中的位置,也很好地证明了模型的可解释性。

代码
文本

2.5 未来展望

DPA-1的提出为机器学习势能函数生产打开了新的范式,证明了“预训练+少量任务微调”流程的可行性,未来研究者将继续致力于势能函数自动化生产、自动化测试,也会继续关注比如多任务训练、无监督学习、模型压缩、蒸馏等操作,方便用户一键生成下游任务所需的势能函数。此外,更大更全的数据库、下游任务与dflow工作流框架的结合也是未来极具发展性的方向。

代码
文本

3. 固态电解质实战: DPA-1势函数训练

学习了理论知识后,让我们直接动手实践吧! 本节,我们将以固态电解质数据集LiGePS-SSE-PBE为例,开展DPA-1的从头训和微调训练。

注:本教程所使用的数据集源自科学智能广场(AIS-Square),有更多模型和数据需求的同学赶快去探索一下吧~

代码
文本

3.1 数据集下载

代码
文本
[4]
# 下载教程文档,
! git clone https://gitee.com/zhexuan-song/study_examples.git
! cp -r ./study_examples/dpa_dataset/* .
fatal: destination path 'study_examples' already exists and is not an empty directory.
代码
文本

让我们一起看看教程文档包含的内容——

  • 数据集:iter00000[0-2]

    计算精度:PBE; 体系: 💡由DP-GEN迭代生成

  • 输入脚本:input.json

代码
文本
[1]
! tree -L 3
.
├── DeePMD-SSE.ipynb
├── dpa
│   ├── checkpoint
│   ├── dpa.pb
│   ├── input.json
│   ├── iter.000000
│   │   └── 02.fp
│   ├── iter.000001
│   │   └── 02.fp
│   └── iter.000002
│       └── 02.fp
├── dpa-sse.ipynb
└── dpa_finetune
    ├── input.json
    ├── iter.000000
    │   └── 02.fp
    ├── iter.000001
    │   └── 02.fp
    └── iter.000002
        └── 02.fp

14 directories, 6 files
代码
文本

3.2 输入脚本准备(从头训)

代码
文本
[2]
# 准备输入文件
! cat ./dpa/input.json
{
    "model": {
        "descriptor": {
            "type": "se_atten",
            "sel": 60,
            "rcut_smth": 0.5,
            "rcut": 6.0,
            "neuron": [
                25,
                50,
                100
            ],
            "resnet_dt": false,
            "axis_neuron": 16,
            "attn": 128,
            "attn_layer": 1,
            "attn_dotr": true,
            "attn_mask": false,
            "seed": 1801819940,
            "_activation_function": "tanh"
        },
        "fitting_net": {
            "neuron": [
                240,
                240,
                240
            ],
            "resnet_dt": true,
            "_coord_norm": true,
            "_type_fitting_net": false,
            "seed": 2375417769,
            "_activation_function": "tanh"
        },
        "type_map": [
            "Li",
            "Ge",
            "P",
            "S"
        ]
    },
    "learning_rate": {
        "type": "exp",
        "start_lr": 0.001,
        "decay_steps": 50,
        "stop_lr": 3.51e-08
    },
    "loss": {
        "start_pref_e": 0.02,
        "limit_pref_e": 1,
        "start_pref_f": 1000,
        "limit_pref_f": 1,
        "start_pref_v": 0,
        "limit_pref_v": 0
    },
    "training": {
        "training_data": {
            "systems": [
		"iter.000000/02.fp/data.000",
                "iter.000000/02.fp/data.001",
                "iter.000000/02.fp/data.002",
                "iter.000000/02.fp/data.003"
            ],
            "batch_size": 1
        },
        "validation_data": {
            "systems": [
                "iter.000001/02.fp/data.000",
                "iter.000001/02.fp/data.001",
                "iter.000001/02.fp/data.002",
                "iter.000001/02.fp/data.003",
                "iter.000002/02.fp/data.000",
                "iter.000002/02.fp/data.001",
                "iter.000002/02.fp/data.002",
                "iter.000002/02.fp/data.003"
                ],
            "batch_size": 1
        },
        "numb_steps": 10000,
        "seed": 3982377700,
        "_comment": "that's all",
        "disp_file": "lcurve.out",
        "disp_freq": 100,
        "numb_test": 1,
        "save_freq": 2000,
        "save_ckpt": "model.ckpt",
        "disp_training": true,
        "time_training": true,
        "profiling": false,
        "profiling_file": "timeline.json"
    }
}
代码
文本

相比于dp_se_e2模型,dpa使用了se_atten作为描述子,主要修改的参数集中在descriptor部分

"descriptor": {
    "type": "se_atten",
    "rcut_smth": 0.5,
    "rcut": 6.0,
    "sel": 60,
    "neuron": [25,50,100],
    "axis_neuron": 16,
    "resnet_dt": false,
    "attn": 128,
    "attn_layer": 2,
    "attn_mask": false,
    "attn_dotr": true,
    "seed": 1801819940,
    "_activation_function": "tanh"
},
代码
文本

相比于之前大家常用的 se_e2_a 描述子来说,有以下几个参数有区别:

  • type:"se_atten":表示采用DPA-1描述子结构;
  • rcut:邻近列表的截断半径,rcut_smth:平滑起点;
  • sel:纳入考虑的最大邻居原子个数总和,这个值和DPA-1的训练效率高度相关,一般我们不用设置太大,推荐最大不超过200。DPA-1论文中训练含有56种元素的OC2M数据集,也只用了120就已经足够了;与dp_se_e2中输入的list型不同,这里sel为int类型;
  • neuron:指定嵌入网络的大小
  • axis_neuron:嵌入矩阵的子矩阵大小,即DeepPot-SE论文中的the axis matrix
  • resnet_dt:若选项设置为true,则在ResNet中使用时间步长
  • seed:初始化模型参数时用于生成随机数的随机种子

除此以外,还新增了一些attention机制相关的参数:

  • attn":attention过程中的隐向量长度;
  • attn_layer:代表总共进行几层attention过程,一般我们推荐2层就可以;
  • attn_mask:代表是否将attention权重的对角线mask掉;
  • attn_dotr:代表是否对attention权重点乘相对坐标的乘积,类似一个门控注意力机制(Gated Attention Mechanism) 其他参数和大家常用的“se_e2_a”描述子中代表的含义保持一致,大家可以参考这里来获得更详细的解释。

其他注意事项:

对于模型的其他部分,首先DPA-1仅支持使用“ener”类型的拟合(fitting)网络,可以参考标准拟合网络的参数设置。

其次,对于DPA-1来说,会默认启用元素类型编码(type embedding)来编码元素相关的信息,扩大模型对元素类型的容量,默认参数如下:

"type_embedding":{
            "neuron":           [2, 4, 8],
            "resnet_dt":        false,
            "seed":             1
        },

其中的参数含义和标准元素编码参数保持一致,如果想修改这些默认参数,可以把上述参数手动加到type_embedding进行自定义。

DPA-1非常适用于包含多种元素的体系,尤其是有十种以上元素的体系,这时候需要手动添加每种元素编号对应的元素种类,即type_map参数:

"type_map": [
    "Li",
    "Ge",
    "P",
    "S"
  ]
代码
文本

3.3 模型训练(从头训)

代码
文本
[4]
# dpa模型训练;采用c12_m92_1 * NVIDIA V100,预计十分钟~
! cd ./dpa/; dp train input.json; dp freeze -o dpa.pb
2023-07-22 16:55:05.785055: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-22 16:55:06.825439: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-07-22 16:55:06.825532: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-07-22 16:55:06.825552: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module.
/opt/conda/lib/python3.8/site-packages/deepmd/utils/compat.py:358: UserWarning: The argument training->numb_test has been deprecated since v2.0.0. Use training->validation_data->batch_size instead.
  warnings.warn(
DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
DEEPMD INFO    training data with min nbor dist: 1.7120899465949608
DEEPMD INFO    training data with max nbor size: [56]
DEEPMD INFO     _____               _____   __  __  _____           _     _  _   
DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |  
DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_ 
DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_ 
DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
DEEPMD INFO    Please read and cite:
DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO    installed to:         /deepmd-kit/_skbuild/linux-x86_64-3.8/cmake-install
DEEPMD INFO    source :              v2.2.0.b0-77-gc1299196
DEEPMD INFO    source brach:         devel
DEEPMD INFO    source commit:        c1299196
DEEPMD INFO    source commit at:     2023-02-28 09:06:04 +0800
DEEPMD INFO    build float prec:     double
DEEPMD INFO    build variant:        cpu
DEEPMD INFO    build with tf inc:    /opt/conda/lib/python3.8/site-packages/tensorflow/include;/opt/conda/lib/python3.8/site-packages/tensorflow/include
DEEPMD INFO    build with tf lib:    
DEEPMD INFO    ---Summary of the training---------------------------------------
DEEPMD INFO    running on:           bohrium-13387-1030178
DEEPMD INFO    computing device:     gpu:0
DEEPMD INFO    CUDA_VISIBLE_DEVICES: unset
DEEPMD INFO    Count of visible GPU: 1
DEEPMD INFO    num_intra_threads:    0
DEEPMD INFO    num_inter_threads:    0
DEEPMD INFO    -----------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
DEEPMD INFO    found 4 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                    iter.000000/02.fp/data.000     400       1     127  0.241    T
DEEPMD INFO                    iter.000000/02.fp/data.001     400       1     131  0.248    T
DEEPMD INFO                    iter.000000/02.fp/data.002     400       1     133  0.252    T
DEEPMD INFO                    iter.000000/02.fp/data.003     400       1     137  0.259    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
DEEPMD INFO    found 8 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                    iter.000001/02.fp/data.000     400       1     117  0.111    T
DEEPMD INFO                    iter.000001/02.fp/data.001     400       1     137  0.129    T
DEEPMD INFO                    iter.000001/02.fp/data.002     400       1     137  0.129    T
DEEPMD INFO                    iter.000001/02.fp/data.003     400       1     138  0.130    T
DEEPMD INFO                    iter.000002/02.fp/data.000     400       1     133  0.126    T
DEEPMD INFO                    iter.000002/02.fp/data.001     400       1     133  0.126    T
DEEPMD INFO                    iter.000002/02.fp/data.002     400       1     134  0.127    T
DEEPMD INFO                    iter.000002/02.fp/data.003     400       1     129  0.122    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    training without frame parameter
DEEPMD INFO    data stating... (this step may take long time)
DEEPMD INFO    built lr
DEEPMD INFO    built network
DEEPMD INFO    built training
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 50, decay_rate 0.950006, final lr will be 3.51e-08
DEEPMD INFO    batch     100 training time 8.37 s, testing time 0.05 s
DEEPMD INFO    batch     200 training time 6.22 s, testing time 0.05 s
DEEPMD INFO    batch     300 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch     400 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch     500 training time 6.22 s, testing time 0.05 s
DEEPMD INFO    batch     600 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch     700 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch     800 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch     900 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1000 training time 6.22 s, testing time 0.05 s
DEEPMD INFO    batch    1100 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1200 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    1300 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1400 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1500 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1600 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1700 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    1800 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    1900 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    2000 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    saved checkpoint model.ckpt
DEEPMD INFO    batch    2100 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    2200 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    2300 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    2400 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    2500 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    2600 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    2700 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    2800 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    2900 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3000 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3100 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    3200 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3300 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3400 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3500 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3600 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3700 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3800 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    batch    3900 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    batch    4000 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    saved checkpoint model.ckpt
DEEPMD INFO    batch    4100 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    4200 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    4300 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    4400 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    4500 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    batch    4600 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    4700 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    4800 training time 6.22 s, testing time 0.05 s
DEEPMD INFO    batch    4900 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    5000 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch    5100 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    5200 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    5300 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    batch    5400 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    5500 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    batch    5600 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    batch    5700 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    batch    5800 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    5900 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    batch    6000 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    saved checkpoint model.ckpt
DEEPMD INFO    batch    6100 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    6200 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    6300 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    6400 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    6500 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    6600 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    6700 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    6800 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    6900 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    7000 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    7100 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    7200 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    batch    7300 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    7400 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    7500 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    7600 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    7700 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    batch    7800 training time 6.22 s, testing time 0.05 s
DEEPMD INFO    batch    7900 training time 6.23 s, testing time 0.05 s
DEEPMD INFO    batch    8000 training time 6.24 s, testing time 0.05 s
DEEPMD INFO    saved checkpoint model.ckpt
DEEPMD INFO    batch    8100 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    batch    8200 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    8300 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    8400 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch    8500 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    8600 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch    8700 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    8800 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    8900 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    9000 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    batch    9100 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    9200 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    batch    9300 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    9400 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    9500 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    9600 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    batch    9700 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    9800 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    batch    9900 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch   10000 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    saved checkpoint model.ckpt
DEEPMD INFO    average training time: 0.0618 s/batch (exclude first 100 batches)
DEEPMD INFO    finished training
DEEPMD INFO    wall time: 635.323 s
2023-07-22 17:05:57.559565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-22 17:05:58.573920: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-07-22 17:05:58.574019: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-07-22 17:05:58.574036: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
DEEPMD INFO    The following nodes will be frozen: ['model_type', 'descrpt_attr/rcut', 'descrpt_attr/ntypes', 'model_attr/tmap', 'model_attr/model_type', 'model_attr/model_version', 'train_attr/min_nbor_dist', 'train_attr/training_script', 'o_energy', 'o_force', 'o_virial', 'o_atom_energy', 'o_atom_virial', 'fitting_attr/dfparam', 'fitting_attr/daparam']
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/entrypoints/freeze.py:354: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/entrypoints/freeze.py:354: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/convert_to_constants.py:943: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/convert_to_constants.py:943: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
DEEPMD INFO    1334 ops in the final graph.
代码
文本

3.4 模型微调训练

代码
文本
[5]
# 查看输入文件
! cat ./dpa_finetune/input.json
{
    "model": {
        "type_embedding":{"trainable": true},
        "descriptor": {"trainable": true},
        "fitting_net": {"trainable": true},
        "type_map": [
            "Li",
            "Ge",
            "P",
            "S"
        ]
    },
    "learning_rate": {
        "type": "exp",
        "start_lr": 0.001,
        "decay_steps": 50,
        "stop_lr": 3.51e-08
    },
    "loss": {
        "type": "ener",
        "start_pref_e": 0.02,
        "limit_pref_e": 1,
        "start_pref_f": 1000,
        "limit_pref_f": 1,
        "start_pref_v": 0,
        "limit_pref_v": 0
    },
    "training": {
        "training_data": {
            "systems": [
        		"iter.000000/02.fp/data.000",
                "iter.000000/02.fp/data.001",
                "iter.000000/02.fp/data.002",
                "iter.000000/02.fp/data.003"
            ],
            "batch_size": 1
        },
        "validation_data": {
            "systems": [
                "iter.000001/02.fp/data.000",
                "iter.000001/02.fp/data.001",
                "iter.000001/02.fp/data.002",
                "iter.000001/02.fp/data.003",
                "iter.000002/02.fp/data.000",
                "iter.000002/02.fp/data.001",
                "iter.000002/02.fp/data.002",
                "iter.000002/02.fp/data.003"
                ],
            "batch_size": 1
        },
        "numb_steps": 10000,
        "seed": 3982377700,
        "_comment": "that's all",
        "disp_file": "lcurve.out",
        "disp_freq": 100,
        "numb_test": 1,
        "save_freq": 2000,
        "save_ckpt": "model.ckpt",
        "disp_training": true,
        "time_training": true,
        "profiling": false,
        "profiling_file": "timeline.json"
    }
}
代码
文本

在微调训练部分的输入文件中,我们只需将type_embeddingdescriptorfitting_net参数改为trainable即可,无需重复书写

    "model": {
        "type_embedding":{"trainable": true},
        "descriptor": {"trainable": true},
        "fitting_net": {"trainable": true},

在本教程中,我们简单使用刚才已经训练好的dpa.pb作为预训练模型示例,进行微调训练

微调训练的命令增加了--finetune dpa.pb选项

代码
文本
[6]
# 基于dpa.pb的dpa_finetune模型训练;采用c12_m92_1 * NVIDIA V100,预计十分钟~
! cp ./dpa/dpa.pb ./dpa_finetune/ ; cd ./dpa_finetune/; dp train input.json --finetune dpa.pb ; dp freeze -o ./dpa_finetune.pb
2023-07-22 17:08:43.230841: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-22 17:08:44.271205: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-07-22 17:08:44.271303: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-07-22 17:08:44.271321: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module.
DEEPMD INFO    Change the model configurations according to the pretrained one...
DEEPMD INFO    Change the 'descriptor' from {'trainable': True} to {'type': 'se_atten', 'sel': 60, 'rcut_smth': 0.5, 'rcut': 6.0, 'neuron': [25, 50, 100], 'resnet_dt': False, 'axis_neuron': 16, 'attn': 128, 'attn_layer': 2, 'attn_dotr': True, 'attn_mask': False, 'seed': 1801819940, 'activation_function': 'tanh', 'type_one_side': False, 'precision': 'default', 'trainable': True, 'exclude_types': []}.
DEEPMD INFO    Change the 'fitting_net' from {'trainable': True} to {'neuron': [240, 240, 240], 'resnet_dt': True, 'seed': 2375417769, 'type': 'ener', 'numb_fparam': 0, 'numb_aparam': 0, 'activation_function': 'tanh', 'precision': 'default', 'trainable': True, 'rcond': 0.001, 'atom_ener': [], 'use_aparam_as_mask': False}.
/opt/conda/lib/python3.8/site-packages/deepmd/utils/compat.py:358: UserWarning: The argument training->numb_test has been deprecated since v2.0.0. Use training->validation_data->batch_size instead.
  warnings.warn(
DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
DEEPMD INFO    training data with min nbor dist: 1.7120899465949608
DEEPMD INFO    training data with max nbor size: [56]
DEEPMD INFO     _____               _____   __  __  _____           _     _  _   
DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |  
DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_ 
DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_ 
DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
DEEPMD INFO    Please read and cite:
DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO    installed to:         /deepmd-kit/_skbuild/linux-x86_64-3.8/cmake-install
DEEPMD INFO    source :              v2.2.0.b0-77-gc1299196
DEEPMD INFO    source brach:         devel
DEEPMD INFO    source commit:        c1299196
DEEPMD INFO    source commit at:     2023-02-28 09:06:04 +0800
DEEPMD INFO    build float prec:     double
DEEPMD INFO    build variant:        cpu
DEEPMD INFO    build with tf inc:    /opt/conda/lib/python3.8/site-packages/tensorflow/include;/opt/conda/lib/python3.8/site-packages/tensorflow/include
DEEPMD INFO    build with tf lib:    
DEEPMD INFO    ---Summary of the training---------------------------------------
DEEPMD INFO    running on:           bohrium-13387-1030178
DEEPMD INFO    computing device:     gpu:0
DEEPMD INFO    CUDA_VISIBLE_DEVICES: unset
DEEPMD INFO    Count of visible GPU: 1
DEEPMD INFO    num_intra_threads:    0
DEEPMD INFO    num_inter_threads:    0
DEEPMD INFO    -----------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
DEEPMD INFO    found 4 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                    iter.000000/02.fp/data.000     400       1     127  0.241    T
DEEPMD INFO                    iter.000000/02.fp/data.001     400       1     131  0.248    T
DEEPMD INFO                    iter.000000/02.fp/data.002     400       1     133  0.252    T
DEEPMD INFO                    iter.000000/02.fp/data.003     400       1     137  0.259    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
DEEPMD INFO    found 8 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                    iter.000001/02.fp/data.000     400       1     117  0.111    T
DEEPMD INFO                    iter.000001/02.fp/data.001     400       1     137  0.129    T
DEEPMD INFO                    iter.000001/02.fp/data.002     400       1     137  0.129    T
DEEPMD INFO                    iter.000001/02.fp/data.003     400       1     138  0.130    T
DEEPMD INFO                    iter.000002/02.fp/data.000     400       1     133  0.126    T
DEEPMD INFO                    iter.000002/02.fp/data.001     400       1     133  0.126    T
DEEPMD INFO                    iter.000002/02.fp/data.002     400       1     134  0.127    T
DEEPMD INFO                    iter.000002/02.fp/data.003     400       1     129  0.122    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    training without frame parameter
DEEPMD INFO    Changing energy bias in pretrained model for types ['Li', 'Ge', 'P', 'S']... (this step may take long time)
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
DEEPMD INFO    Adjust batch size from 1024 to 2048
DEEPMD INFO    Adjust batch size from 2048 to 4096
DEEPMD INFO    Adjust batch size from 4096 to 8192
DEEPMD INFO    RMSE of atomic energy after linear regression is: 0.0018085361367408837 eV/atom.
DEEPMD INFO    Change energy bias of ['Li', 'Ge', 'P', 'S'] from [-4.17483491 -0.41748349 -0.83496698 -5.00980189] to [-4.1760068  -0.41760068 -0.83520136 -5.01120816].
DEEPMD INFO    built lr
DEEPMD INFO    built network
DEEPMD INFO    built training
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
DEEPMD INFO    initialize training from the frozen pretrained model
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 50, decay_rate 0.950006, final lr will be 3.51e-08
DEEPMD INFO    batch     100 training time 8.34 s, testing time 0.05 s
DEEPMD INFO    batch     200 training time 6.22 s, testing time 0.05 s
DEEPMD INFO    batch     300 training time 6.22 s, testing time 0.05 s
DEEPMD INFO    batch     400 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch     500 training time 6.22 s, testing time 0.05 s
DEEPMD INFO    batch     600 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch     700 training time 6.22 s, testing time 0.05 s
DEEPMD INFO    batch     800 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch     900 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch    1000 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1100 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1200 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1300 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1400 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1500 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    1600 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1700 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1800 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    1900 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    2000 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    saved checkpoint model.ckpt
DEEPMD INFO    batch    2100 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    2200 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch    2300 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    2400 training time 6.20 s, testing time 0.05 s
DEEPMD INFO    batch    2500 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    2600 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    2700 training time 6.21 s, testing time 0.05 s
DEEPMD INFO    batch    2800 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    batch    2900 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3000 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3100 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    3200 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3300 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3400 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3500 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3600 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3700 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3800 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    3900 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    4000 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    saved checkpoint model.ckpt
DEEPMD INFO    batch    4100 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    4200 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    4300 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    4400 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    4500 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    4600 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    4700 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    4800 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    4900 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    5000 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    5100 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    5200 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    5300 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    5400 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    5500 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    5600 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    batch    5700 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    5800 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    5900 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    6000 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    saved checkpoint model.ckpt
DEEPMD INFO    batch    6100 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    6200 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    6300 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    6400 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    6500 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    6600 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    6700 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    6800 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    6900 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    7000 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    7100 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    7200 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    7300 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    7400 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    7500 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    batch    7600 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    7700 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    7800 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    batch    7900 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    8000 training time 6.15 s, testing time 0.05 s
DEEPMD INFO    saved checkpoint model.ckpt
DEEPMD INFO    batch    8100 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    8200 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    8300 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    8400 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    8500 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    8600 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    8700 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    batch    8800 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    8900 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    9000 training time 6.19 s, testing time 0.05 s
DEEPMD INFO    batch    9100 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    9200 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    9300 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    batch    9400 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    batch    9500 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    9600 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    batch    9700 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch    9800 training time 6.17 s, testing time 0.05 s
DEEPMD INFO    batch    9900 training time 6.16 s, testing time 0.05 s
DEEPMD INFO    batch   10000 training time 6.18 s, testing time 0.05 s
DEEPMD INFO    saved checkpoint model.ckpt
DEEPMD INFO    average training time: 0.0617 s/batch (exclude first 100 batches)
DEEPMD INFO    finished training
DEEPMD INFO    wall time: 634.868 s
2023-07-22 17:19:41.065653: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-22 17:19:42.102660: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-07-22 17:19:42.102767: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-07-22 17:19:42.102785: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
DEEPMD INFO    The following nodes will be frozen: ['model_type', 'descrpt_attr/rcut', 'descrpt_attr/ntypes', 'model_attr/tmap', 'model_attr/model_type', 'model_attr/model_version', 'train_attr/min_nbor_dist', 'train_attr/training_script', 'o_energy', 'o_force', 'o_virial', 'o_atom_energy', 'o_atom_virial', 'fitting_attr/dfparam', 'fitting_attr/daparam']
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/entrypoints/freeze.py:354: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/entrypoints/freeze.py:354: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/convert_to_constants.py:943: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/convert_to_constants.py:943: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
DEEPMD INFO    1336 ops in the final graph.
代码
文本

一个基于已有预训练模型的微调势函数就训练好了!总结一下,它与普通的模型训练差异在于:

  • input.json的参数设置
  • 命令行增加--fintune pretained.pb
代码
文本

3.5 模型检验

现在,我们已经完成了固态电解质dpa势函数的从头训和微调训练。让我们通过lcurve(学习率变化曲线输出文件)和dp test功能检验下模型的表现吧!

📌注:由于节省时间,教程中的训练步数没有到达一个收敛的结果;在真实训练场景中,记得调整至合适的训练时长~

代码
文本
[7]
# 查看dpa/lcurve.out结果
! cat ./dpa/lcurve.out
#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      2.77e+01    2.71e+01      2.01e+00    2.03e+00      8.58e-01    8.36e-01    1.0e-03
    100      1.83e+01    1.62e+01      8.53e-02    5.69e-02      6.07e-01    5.38e-01    9.0e-04
    200      9.90e+00    9.35e+00      1.78e-01    1.78e-01      3.42e-01    3.23e-01    8.1e-04
    300      7.22e+00    6.22e+00      9.65e-02    1.05e-01      2.64e-01    2.26e-01    7.4e-04
    400      6.61e+00    5.67e+00      7.50e-03    1.31e-02      2.57e-01    2.20e-01    6.6e-04
    500      5.75e+00    5.45e+00      5.07e-02    4.98e-02      2.34e-01    2.21e-01    6.0e-04
    600      5.83e+00    4.90e+00      1.72e-03    3.21e-03      2.51e-01    2.11e-01    5.4e-04
    700      3.94e+00    4.28e+00      2.77e-02    3.05e-02      1.77e-01    1.93e-01    4.9e-04
    800      4.64e+00    4.28e+00      6.33e-02    6.20e-02      2.16e-01    1.99e-01    4.4e-04
    900      4.17e+00    4.59e+00      2.14e-02    1.86e-02      2.09e-01    2.30e-01    4.0e-04
   1000      4.24e+00    3.72e+00      2.79e-02    2.48e-02      2.22e-01    1.95e-01    3.6e-04
   1100      3.41e+00    3.72e+00      4.79e-02    4.03e-02      1.84e-01    2.03e-01    3.2e-04
   1200      3.17e+00    3.45e+00      1.29e-02    1.99e-02      1.85e-01    2.00e-01    2.9e-04
   1300      3.46e+00    3.29e+00      8.73e-02    7.90e-02      1.92e-01    1.84e-01    2.6e-04
   1400      3.06e+00    3.19e+00      5.29e-02    4.67e-02      1.89e-01    2.00e-01    2.4e-04
   1500      2.89e+00    2.60e+00      7.33e-03    9.06e-03      1.96e-01    1.77e-01    2.1e-04
   1600      2.75e+00    2.56e+00      5.06e-03    1.38e-04      1.97e-01    1.84e-01    1.9e-04
   1700      2.44e+00    2.52e+00      1.66e-02    1.52e-02      1.82e-01    1.89e-01    1.7e-04
   1800      1.87e+00    2.47e+00      2.41e-02    7.82e-03      1.44e-01    1.96e-01    1.6e-04
   1900      2.24e+00    2.21e+00      1.16e-02    1.34e-02      1.86e-01    1.84e-01    1.4e-04
   2000      2.03e+00    1.88e+00      1.75e-02    2.10e-02      1.76e-01    1.62e-01    1.3e-04
   2100      1.95e+00    1.91e+00      1.21e-02    1.50e-02      1.79e-01    1.75e-01    1.2e-04
   2200      1.89e+00    1.99e+00      1.06e-02    1.47e-02      1.83e-01    1.92e-01    1.0e-04
   2300      1.81e+00    1.78e+00      9.27e-03    6.88e-03      1.85e-01    1.82e-01    9.4e-05
   2400      1.63e+00    1.65e+00      1.87e-02    1.64e-02      1.72e-01    1.75e-01    8.5e-05
   2500      1.76e+00    1.67e+00      1.97e-02    2.27e-02      1.95e-01    1.82e-01    7.7e-05
   2600      1.50e+00    1.51e+00      9.35e-03    6.31e-03      1.78e-01    1.79e-01    6.9e-05
   2700      1.47e+00    1.22e+00      1.93e-03    4.47e-03      1.84e-01    1.53e-01    6.3e-05
   2800      1.44e+00    1.44e+00      1.51e-02    1.35e-02      1.86e-01    1.87e-01    5.7e-05
   2900      1.29e+00    1.42e+00      1.09e-02    8.51e-03      1.77e-01    1.96e-01    5.1e-05
   3000      1.30e+00    1.28e+00      8.11e-03    7.30e-03      1.88e-01    1.85e-01    4.6e-05
   3100      1.07e+00    1.21e+00      3.57e-03    6.98e-03      1.64e-01    1.84e-01    4.2e-05
   3200      1.12e+00    1.07e+00      4.16e-03    2.35e-03      1.80e-01    1.73e-01    3.8e-05
   3300      1.22e+00    1.05e+00      2.06e-03    3.05e-03      2.07e-01    1.78e-01    3.4e-05
   3400      1.04e+00    1.00e+00      1.11e-02    5.72e-03      1.82e-01    1.77e-01    3.1e-05
   3500      8.83e-01    9.68e-01      4.07e-03    4.30e-03      1.65e-01    1.80e-01    2.8e-05
   3600      9.95e-01    9.55e-01      1.16e-02    7.57e-03      1.90e-01    1.85e-01    2.5e-05
   3700      8.91e-01    8.59e-01      9.37e-03    1.26e-02      1.80e-01    1.70e-01    2.2e-05
   3800      9.13e-01    8.02e-01      9.55e-04    3.18e-03      1.98e-01    1.73e-01    2.0e-05
   3900      8.30e-01    7.84e-01      8.00e-03    4.25e-03      1.85e-01    1.77e-01    1.8e-05
   4000      7.63e-01    8.00e-01      1.97e-03    1.83e-03      1.82e-01    1.91e-01    1.7e-05
   4100      7.43e-01    7.65e-01      8.91e-03    1.14e-02      1.81e-01    1.83e-01    1.5e-05
   4200      7.29e-01    6.09e-01      1.04e-02    7.09e-03      1.84e-01    1.56e-01    1.3e-05
   4300      7.21e-01    7.04e-01      1.78e-03    6.26e-03      1.99e-01    1.91e-01    1.2e-05
   4400      6.60e-01    6.26e-01      1.30e-03    5.52e-03      1.91e-01    1.78e-01    1.1e-05
   4500      6.25e-01    5.83e-01      6.60e-03    3.96e-03      1.85e-01    1.75e-01    9.9e-06
   4600      5.86e-01    5.10e-01      3.30e-04    5.71e-03      1.86e-01    1.58e-01    8.9e-06
   4700      5.21e-01    5.50e-01      6.89e-03    5.96e-03      1.67e-01    1.78e-01    8.1e-06
   4800      5.68e-01    5.41e-01      3.52e-03    1.32e-03      1.96e-01    1.88e-01    7.3e-06
   4900      4.90e-01    5.28e-01      4.22e-03    6.94e-03      1.76e-01    1.86e-01    6.6e-06
   5000      4.88e-01    4.61e-01      2.72e-03    1.98e-03      1.84e-01    1.75e-01    5.9e-06
   5100      4.75e-01    4.26e-01      6.41e-04    2.15e-03      1.88e-01    1.68e-01    5.3e-06
   5200      4.48e-01    4.38e-01      5.80e-03    1.03e-02      1.80e-01    1.61e-01    4.8e-06
   5300      4.53e-01    4.28e-01      6.71e-03    1.81e-03      1.87e-01    1.85e-01    4.4e-06
   5400      4.24e-01    4.13e-01      2.37e-03    8.79e-04      1.90e-01    1.86e-01    3.9e-06
   5500      4.34e-01    4.11e-01      4.81e-04    3.49e-03      2.04e-01    1.90e-01    3.5e-06
   5600      3.85e-01    3.41e-01      2.10e-04    4.40e-03      1.88e-01    1.61e-01    3.2e-06
   5700      3.46e-01    3.53e-01      3.73e-03    3.43e-03      1.72e-01    1.76e-01    2.9e-06
   5800      3.55e-01    3.47e-01      2.80e-03    5.82e-04      1.85e-01    1.83e-01    2.6e-06
   5900      3.28e-01    3.29e-01      1.18e-03    1.85e-03      1.79e-01    1.79e-01    2.4e-06
   6000      3.09e-01    3.30e-01      3.07e-03    1.18e-03      1.71e-01    1.86e-01    2.1e-06
   6100      3.23e-01    3.17e-01      6.21e-03    2.25e-03      1.75e-01    1.84e-01    1.9e-06
   6200      3.14e-01    3.30e-01      3.57e-04    3.39e-03      1.90e-01    1.96e-01    1.7e-06
   6300      3.07e-01    3.15e-01      9.20e-04    3.24e-03      1.92e-01    1.93e-01    1.6e-06
   6400      2.97e-01    2.99e-01      1.34e-03    5.33e-03      1.91e-01    1.80e-01    1.4e-06
   6500      2.72e-01    2.96e-01      2.25e-03    6.45e-03      1.78e-01    1.77e-01    1.3e-06
   6600      2.73e-01    2.93e-01      1.84e-03    2.25e-03      1.85e-01    1.98e-01    1.1e-06
   6700      2.86e-01    2.77e-01      4.48e-03    5.85e-03      1.90e-01    1.76e-01    1.0e-06
   6800      2.56e-01    2.75e-01      1.36e-03    2.78e-03      1.83e-01    1.93e-01    9.3e-07
   6900      2.60e-01    2.64e-01      2.62e-03    2.45e-03      1.88e-01    1.91e-01    8.4e-07
   7000      2.40e-01    2.35e-01      3.99e-03    4.34e-03      1.71e-01    1.64e-01    7.6e-07
   7100      2.40e-01    2.54e-01      1.74e-03    3.09e-03      1.83e-01    1.90e-01    6.9e-07
   7200      2.18e-01    2.19e-01      1.32e-03    1.63e-03      1.70e-01    1.70e-01    6.2e-07
   7300      2.31e-01    2.30e-01      3.02e-03    2.25e-03      1.79e-01    1.80e-01    5.6e-07
   7400      2.31e-01    2.41e-01      3.03e-03    1.80e-03      1.82e-01    1.94e-01    5.1e-07
   7500      2.24e-01    2.41e-01      1.50e-04    2.81e-03      1.86e-01    1.94e-01    4.6e-07
   7600      2.29e-01    2.35e-01      1.44e-03    1.79e-03      1.91e-01    1.96e-01    4.1e-07
   7700      2.11e-01    1.94e-01      6.94e-04    2.66e-03      1.80e-01    1.59e-01    3.7e-07
   7800      2.09e-01    2.20e-01      2.60e-03    1.23e-03      1.75e-01    1.89e-01    3.4e-07
   7900      2.20e-01    1.96e-01      3.74e-03    3.86e-03      1.82e-01    1.58e-01    3.0e-07
   8000      2.14e-01    2.81e-01      3.20e-03    1.18e-02      1.81e-01    1.35e-01    2.7e-07
   8100      2.14e-01    2.14e-01      3.17e-03    8.66e-05      1.83e-01    1.92e-01    2.5e-07
   8200      2.46e-01    2.11e-01      7.13e-03    4.51e-04      1.81e-01    1.91e-01    2.2e-07
   8300      2.13e-01    2.05e-01      2.08e-03    2.95e-03      1.90e-01    1.79e-01    2.0e-07
   8400      2.06e-01    1.96e-01      3.87e-03    2.31e-04      1.76e-01    1.80e-01    1.8e-07
   8500      1.93e-01    2.19e-01      8.70e-04    2.88e-03      1.78e-01    1.96e-01    1.6e-07
   8600      2.18e-01    2.14e-01      2.43e-03    8.26e-04      1.98e-01    1.99e-01    1.5e-07
   8700      2.12e-01    2.06e-01      2.30e-03    3.80e-03      1.95e-01    1.80e-01    1.3e-07
   8800      2.31e-01    2.02e-01      6.19e-03    3.20e-03      1.84e-01    1.81e-01    1.2e-07
   8900      1.93e-01    1.79e-01      2.89e-03    4.03e-04      1.75e-01    1.70e-01    1.1e-07
   9000      1.82e-01    1.93e-01      9.33e-04    1.00e-03      1.73e-01    1.83e-01    9.8e-08
   9100      1.83e-01    2.22e-01      1.07e-03    2.42e-03      1.74e-01    2.07e-01    8.8e-08
   9200      1.93e-01    1.95e-01      3.45e-03    2.74e-03      1.73e-01    1.80e-01    8.0e-08
   9300      2.07e-01    1.89e-01      4.09e-03    2.20e-05      1.83e-01    1.83e-01    7.2e-08
   9400      2.10e-01    2.30e-01      1.47e-04    3.88e-03      2.03e-01    2.10e-01    6.5e-08
   9500      2.28e-01    1.81e-01      6.78e-03    1.27e-03      1.78e-01    1.74e-01    5.9e-08
   9600      2.00e-01    1.93e-01      1.75e-03    4.21e-03      1.92e-01    1.69e-01    5.3e-08
   9700      1.90e-01    1.69e-01      2.14e-03    2.24e-03      1.81e-01    1.59e-01    4.8e-08
   9800      1.85e-01    2.30e-01      1.29e-04    5.23e-03      1.81e-01    2.01e-01    4.3e-08
   9900      1.88e-01    1.97e-01      3.34e-03    2.92e-03      1.73e-01    1.85e-01    3.9e-08
  10000      1.79e-01    2.07e-01      1.66e-03    3.54e-03      1.73e-01    1.91e-01    3.5e-08
代码
文本

想要直观对比两个模型的lcurve差异?一键运行为大家准备好的可视化脚本试试~

代码
文本
[2]
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import glob

def plot_ax():
fig, ax = plt.subplots(constrained_layout=True, figsize=(12 / 2.54, 9 / 2.54))
ax.tick_params(width=1)
ax.spines['left'].set_linewidth(1)
ax.spines['right'].set_linewidth(1)
ax.spines['bottom'].set_linewidth(1)
ax.spines['top'].set_linewidth(1)
return ax

# lcurve.out
def plt_lcurve(input_pt:str='./', val:bool=False, name:str='lcurve.png'):
"""绘制lcurve图像,
输入:lcurve.out路径; 是否有validation_data
输出:显示,保存图像
"""
lcurve_pt=input_pt
ax=plot_ax()
with open(lcurve_pt) as f:
headers = f.readline().split()[1:]
lcurve = pd.DataFrame(np.loadtxt(lcurve_pt), columns=headers)
if val:
legends = ["rmse_e_trn", "rmse_e_val", "rmse_f_trn" , "rmse_f_val" ,"lr"]
else:
legends = ["rmse_e_trn", "rmse_f_trn" ,"lr"]
for legend in legends:
ax.loglog(lcurve["step"], lcurve[legend], label = legend )
ax.legend()

ax.set_xlabel("Training steps")
ax.set_ylim(1e-10, 1e0)
ax.set_ylabel("Loss")
ax.set_title("{}".format(name))
plt.savefig('./{}'.format(name),dpi=300)
plt.show()
# 绘制lcurve变化过程
lcurve=glob.glob('**/lcurve.out')
print(lcurve)

for data in lcurve:
plt_lcurve(data,val=True,name='{}_lcurve.png'.format(data.split('/')[0]))
['dpa/lcurve.out', 'dpa_finetune/lcurve.out']
代码
文本

根据lcurve的变化对比,我们可以看出基于预训练展开的微调模型训练初始具有更低的能量和力损失

让我们计算预测数据和原始数据之间的相关性并可视化查看一下。

代码
文本
[9]
# dp test测试dpa, dpa_finetune模型的能量和力偏差(对dp test功能还不熟悉的同学,可以利用 dp test -h 查看帮助文档)
! cd ./dpa_finetune/; dp test -m dpa.pb -s ./iter.000001/02.fp -d results_dpa
! cd ./dpa_finetune/; dp test -m dpa_finetune.pb -s ./iter.000001/02.fp -d results_finetune
2023-07-22 17:21:20.104037: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-22 17:21:21.129529: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-07-22 17:21:21.129622: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-07-22 17:21:21.129639: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
DEEPMD INFO    # ---------------output of dp test--------------- 
DEEPMD INFO    # testing system : iter.000001/02.fp/data.000
DEEPMD INFO    Adjust batch size from 1024 to 2048
DEEPMD INFO    Adjust batch size from 2048 to 4096
DEEPMD INFO    Adjust batch size from 4096 to 8192
DEEPMD INFO    Adjust batch size from 8192 to 16384
DEEPMD INFO    Adjust batch size from 16384 to 32768
DEEPMD INFO    # number of test data : 100 
DEEPMD INFO    Energy MAE         : 9.427785e-01 eV
DEEPMD INFO    Energy RMSE        : 1.196250e+00 eV
DEEPMD INFO    Energy MAE/Natoms  : 2.356946e-03 eV
DEEPMD INFO    Energy RMSE/Natoms : 2.990626e-03 eV
DEEPMD INFO    Force  MAE         : 1.336536e-01 eV/A
DEEPMD INFO    Force  RMSE        : 1.815210e-01 eV/A
DEEPMD INFO    Virial MAE         : 9.645088e+00 eV
DEEPMD INFO    Virial RMSE        : 1.431580e+01 eV
DEEPMD INFO    Virial MAE/Natoms  : 2.411272e-02 eV
DEEPMD INFO    Virial RMSE/Natoms : 3.578950e-02 eV
DEEPMD INFO    # ----------------------------------------------- 
DEEPMD INFO    # ---------------output of dp test--------------- 
DEEPMD INFO    # testing system : iter.000001/02.fp/data.001
2023-07-22 17:21:42.161708: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize/Square
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-07-22 17:21:42.161959: W tensorflow/tsl/framework/bfc_allocator.cc:492] ********************************************___*************************_*************************_*
2023-07-22 17:21:42.163651: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory
2023-07-22 17:21:52.165476: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize_1/Square
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-07-22 17:21:52.165726: W tensorflow/tsl/framework/bfc_allocator.cc:492] ********************************************___*************************_*************************_*
2023-07-22 17:21:52.165750: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory
2023-07-22 17:22:02.165943: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize_2/Square
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-07-22 17:22:02.166201: W tensorflow/tsl/framework/bfc_allocator.cc:492] ********************************************___*************************_*************************_*
2023-07-22 17:22:02.166226: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory
DEEPMD INFO    Adjust batch size from 32768 to 16384
DEEPMD INFO    # number of test data : 100 
DEEPMD INFO    Energy MAE         : 8.434060e-01 eV
DEEPMD INFO    Energy RMSE        : 1.056923e+00 eV
DEEPMD INFO    Energy MAE/Natoms  : 2.108515e-03 eV
DEEPMD INFO    Energy RMSE/Natoms : 2.642309e-03 eV
DEEPMD INFO    Force  MAE         : 1.342740e-01 eV/A
DEEPMD INFO    Force  RMSE        : 1.823510e-01 eV/A
DEEPMD INFO    Virial MAE         : 9.380907e+00 eV
DEEPMD INFO    Virial RMSE        : 1.377223e+01 eV
DEEPMD INFO    Virial MAE/Natoms  : 2.345227e-02 eV
DEEPMD INFO    Virial RMSE/Natoms : 3.443056e-02 eV
DEEPMD INFO    # ----------------------------------------------- 
DEEPMD INFO    # ---------------output of dp test--------------- 
DEEPMD INFO    # testing system : iter.000001/02.fp/data.002
DEEPMD INFO    # number of test data : 100 
DEEPMD INFO    Energy MAE         : 1.096916e+00 eV
DEEPMD INFO    Energy RMSE        : 1.411227e+00 eV
DEEPMD INFO    Energy MAE/Natoms  : 2.742291e-03 eV
DEEPMD INFO    Energy RMSE/Natoms : 3.528067e-03 eV
DEEPMD INFO    Force  MAE         : 1.328904e-01 eV/A
DEEPMD INFO    Force  RMSE        : 1.803824e-01 eV/A
DEEPMD INFO    Virial MAE         : 8.762420e+00 eV
DEEPMD INFO    Virial RMSE        : 1.282206e+01 eV
DEEPMD INFO    Virial MAE/Natoms  : 2.190605e-02 eV
DEEPMD INFO    Virial RMSE/Natoms : 3.205516e-02 eV
DEEPMD INFO    # ----------------------------------------------- 
DEEPMD INFO    # ---------------output of dp test--------------- 
DEEPMD INFO    # testing system : iter.000001/02.fp/data.003
DEEPMD INFO    # number of test data : 100 
DEEPMD INFO    Energy MAE         : 1.005292e+00 eV
DEEPMD INFO    Energy RMSE        : 1.240209e+00 eV
DEEPMD INFO    Energy MAE/Natoms  : 2.513231e-03 eV
DEEPMD INFO    Energy RMSE/Natoms : 3.100522e-03 eV
DEEPMD INFO    Force  MAE         : 1.341954e-01 eV/A
DEEPMD INFO    Force  RMSE        : 1.820485e-01 eV/A
DEEPMD INFO    Virial MAE         : 8.224030e+00 eV
DEEPMD INFO    Virial RMSE        : 1.197917e+01 eV
DEEPMD INFO    Virial MAE/Natoms  : 2.056007e-02 eV
DEEPMD INFO    Virial RMSE/Natoms : 2.994793e-02 eV
DEEPMD INFO    # ----------------------------------------------- 
DEEPMD INFO    # ----------weighted average of errors----------- 
DEEPMD INFO    # number of systems : 4
DEEPMD INFO    Energy MAE         : 9.720983e-01 eV
DEEPMD INFO    Energy RMSE        : 1.232658e+00 eV
DEEPMD INFO    Energy MAE/Natoms  : 2.430246e-03 eV
DEEPMD INFO    Energy RMSE/Natoms : 3.081644e-03 eV
DEEPMD INFO    Force  MAE         : 1.337533e-01 eV/A
DEEPMD INFO    Force  RMSE        : 1.815773e-01 eV/A
DEEPMD INFO    Virial MAE         : 9.003111e+00 eV
DEEPMD INFO    Virial RMSE        : 1.325257e+01 eV
DEEPMD INFO    Virial MAE/Natoms  : 2.250778e-02 eV
DEEPMD INFO    Virial RMSE/Natoms : 3.313142e-02 eV
DEEPMD INFO    # ----------------------------------------------- 
2023-07-22 17:22:21.320361: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-22 17:22:22.353270: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-07-22 17:22:22.353362: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/mpi/gcc/openmpi-4.1.0rc5/lib:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-07-22 17:22:22.353379: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
DEEPMD INFO    # ---------------output of dp test--------------- 
DEEPMD INFO    # testing system : iter.000001/02.fp/data.000
DEEPMD INFO    Adjust batch size from 1024 to 2048
DEEPMD INFO    Adjust batch size from 2048 to 4096
DEEPMD INFO    Adjust batch size from 4096 to 8192
DEEPMD INFO    Adjust batch size from 8192 to 16384
DEEPMD INFO    Adjust batch size from 16384 to 32768
DEEPMD INFO    # number of test data : 100 
DEEPMD INFO    Energy MAE         : 8.621734e-01 eV
DEEPMD INFO    Energy RMSE        : 1.057169e+00 eV
DEEPMD INFO    Energy MAE/Natoms  : 2.155433e-03 eV
DEEPMD INFO    Energy RMSE/Natoms : 2.642923e-03 eV
DEEPMD INFO    Force  MAE         : 1.101241e-01 eV/A
DEEPMD INFO    Force  RMSE        : 1.489321e-01 eV/A
DEEPMD INFO    Virial MAE         : 1.266581e+01 eV
DEEPMD INFO    Virial RMSE        : 1.956447e+01 eV
DEEPMD INFO    Virial MAE/Natoms  : 3.166453e-02 eV
DEEPMD INFO    Virial RMSE/Natoms : 4.891117e-02 eV
DEEPMD INFO    # ----------------------------------------------- 
DEEPMD INFO    # ---------------output of dp test--------------- 
DEEPMD INFO    # testing system : iter.000001/02.fp/data.001
2023-07-22 17:22:43.410028: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/c_value/MatMul
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-07-22 17:22:43.410321: W tensorflow/tsl/framework/bfc_allocator.cc:492] *******************************************____*************************___******************_****_*
2023-07-22 17:22:43.410369: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at matmul_op_impl.h:731 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[1944000,128] and type double on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2023-07-22 17:22:53.412286: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize/Square
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-07-22 17:22:53.412596: W tensorflow/tsl/framework/bfc_allocator.cc:492] *******************************************____*************************___******************_****_*
2023-07-22 17:22:53.412637: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory
2023-07-22 17:23:03.412833: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.85GiB (rounded to 1990656000)requested by op load/attention_layer_1/l2_normalize_1/Square
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-07-22 17:23:03.413122: W tensorflow/tsl/framework/bfc_allocator.cc:492] *******************************************____*************************___******************_****_*
2023-07-22 17:23:03.413148: W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory
DEEPMD INFO    Adjust batch size from 32768 to 16384
DEEPMD INFO    # number of test data : 100 
DEEPMD INFO    Energy MAE         : 6.472647e-01 eV
DEEPMD INFO    Energy RMSE        : 8.226297e-01 eV
DEEPMD INFO    Energy MAE/Natoms  : 1.618162e-03 eV
DEEPMD INFO    Energy RMSE/Natoms : 2.056574e-03 eV
DEEPMD INFO    Force  MAE         : 1.106034e-01 eV/A
DEEPMD INFO    Force  RMSE        : 1.492427e-01 eV/A
DEEPMD INFO    Virial MAE         : 1.168436e+01 eV
DEEPMD INFO    Virial RMSE        : 1.808673e+01 eV
DEEPMD INFO    Virial MAE/Natoms  : 2.921089e-02 eV
DEEPMD INFO    Virial RMSE/Natoms : 4.521683e-02 eV
DEEPMD INFO    # ----------------------------------------------- 
DEEPMD INFO    # ---------------output of dp test--------------- 
DEEPMD INFO    # testing system : iter.000001/02.fp/data.002
DEEPMD INFO    # number of test data : 100 
DEEPMD INFO    Energy MAE         : 6.905101e-01 eV
DEEPMD INFO    Energy RMSE        : 8.749119e-01 eV
DEEPMD INFO    Energy MAE/Natoms  : 1.726275e-03 eV
DEEPMD INFO    Energy RMSE/Natoms : 2.187280e-03 eV
DEEPMD INFO    Force  MAE         : 1.094928e-01 eV/A
DEEPMD INFO    Force  RMSE        : 1.477969e-01 eV/A
DEEPMD INFO    Virial MAE         : 1.099581e+01 eV
DEEPMD INFO    Virial RMSE        : 1.688399e+01 eV
DEEPMD INFO    Virial MAE/Natoms  : 2.748951e-02 eV
DEEPMD INFO    Virial RMSE/Natoms : 4.220997e-02 eV
DEEPMD INFO    # ----------------------------------------------- 
DEEPMD INFO    # ---------------output of dp test--------------- 
DEEPMD INFO    # testing system : iter.000001/02.fp/data.003
DEEPMD INFO    # number of test data : 100 
DEEPMD INFO    Energy MAE         : 6.999822e-01 eV
DEEPMD INFO    Energy RMSE        : 8.619741e-01 eV
DEEPMD INFO    Energy MAE/Natoms  : 1.749956e-03 eV
DEEPMD INFO    Energy RMSE/Natoms : 2.154935e-03 eV
DEEPMD INFO    Force  MAE         : 1.101929e-01 eV/A
DEEPMD INFO    Force  RMSE        : 1.488093e-01 eV/A
DEEPMD INFO    Virial MAE         : 1.067116e+01 eV
DEEPMD INFO    Virial RMSE        : 1.624072e+01 eV
DEEPMD INFO    Virial MAE/Natoms  : 2.667789e-02 eV
DEEPMD INFO    Virial RMSE/Natoms : 4.060179e-02 eV
DEEPMD INFO    # ----------------------------------------------- 
DEEPMD INFO    # ----------weighted average of errors----------- 
DEEPMD INFO    # number of systems : 4
DEEPMD INFO    Energy MAE         : 7.249826e-01 eV
DEEPMD INFO    Energy RMSE        : 9.086799e-01 eV
DEEPMD INFO    Energy MAE/Natoms  : 1.812456e-03 eV
DEEPMD INFO    Energy RMSE/Natoms : 2.271700e-03 eV
DEEPMD INFO    Force  MAE         : 1.101033e-01 eV/A
DEEPMD INFO    Force  RMSE        : 1.486962e-01 eV/A
DEEPMD INFO    Virial MAE         : 1.150428e+01 eV
DEEPMD INFO    Virial RMSE        : 1.773928e+01 eV
DEEPMD INFO    Virial MAE/Natoms  : 2.876071e-02 eV
DEEPMD INFO    Virial RMSE/Natoms : 4.434820e-02 eV
DEEPMD INFO    # ----------------------------------------------- 
代码
文本

同样地,我们还是借助可视化的方式查看模型的准确度

代码
文本
[10]
# 定义绘制散点图和对角线的函数
def plot(ax, data, key, xlabel, ylabel, min_val, max_val,RMSE):
data_key = f'data_{key}'
pred_key = f'pred_{key}'
ax.scatter(data[data_key], data[pred_key], label=key, s=6)
if type(RMSE) != list:
ax.text(0.1, 0.8, 'RMSE: {:.4e}'.format(RMSE), transform=ax.transAxes, color='blue', size=14)
else:
ax.text(0.05, 0.6, 'RMSE: \nfx:{:.4e}\nfy:{:.4e}\nfz:{:.4e}'.format(RMSE[0],RMSE[1],RMSE[2]), transform=ax.transAxes, color='blue', size=12)
ax.legend()
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
ax.set_xlim(min_val, max_val)
ax.set_ylim(min_val, max_val)
ax.plot([min_val, max_val], [min_val, max_val], 'r', lw=1)



# 使用dptest测试
num_list = ['dpa','finetune']
results = 'test_results'
# 读取数据,并对e_peratom数据
for num in num_list:
print("正在处理的文件: result_{}.e_peratom.out".format(num))
'DPA_notebook_final/dpa_fientune/results_dpa.e_peratom.out'
data_e = np.genfromtxt("./dpa_finetune/results_{}.e_peratom.out".format(num), names=["data_e", "pred_e"])
data_f = np.genfromtxt("./dpa_finetune/results_{}.f.out".format(num), names=["data_fx", "data_fy", "data_fz", "pred_fx", "pred_fy", "pred_fz"])


# 计算能量和力的RMSE
RMSE_f=[]
RMSE_e=np.linalg.norm(data_e['data_e']-data_e['pred_e'], ord=2)/len(data_e['data_e'])**0.5
for j in ['fx', 'fy', 'fz']:
RMSE_f.append((np.linalg.norm(data_f['data_{}'.format(j)]-data_f['pred_{}'.format(j)], ord=2)/len('data_{}'.format(j)))**0.5)


# 计算e和f的最小值和最大值
data_e_stacked = np.column_stack((data_e['data_e'], data_e['pred_e']))
data_f_stacked = np.column_stack((data_f['data_fx'], data_f['data_fy'], data_f['data_fz'], data_f['pred_fx'], data_f['pred_fy'], data_f['pred_fz']))

min_val_e, max_val_e = np.min(data_e_stacked), np.max(data_e_stacked)
min_val_f, max_val_f = np.min(data_f_stacked), np.max(data_f_stacked)

# 绘制散点图并保存结果
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
plot(axs[0], data_e, 'e', 'DFT energy (eV/atom)', 'DP energy (eV/atom)', min_val_e, max_val_e, RMSE_e)
for force_direction in ['fx', 'fy', 'fz']:
plot(axs[1], data_f, force_direction, 'DFT force (eV/Å)', 'DP force (eV/Å)', min_val_f, max_val_f, RMSE_f)
print('输出位置:./dp_test_{}.png'.format(num))
plt.savefig('./dp_test_{}.png'.format(num), dpi=300)
正在处理的文件: result_dpa.e_peratom.out
输出位置:./dp_test_dpa.png
正在处理的文件: result_finetune.e_peratom.out
输出位置:./dp_test_finetune.png
代码
文本

从Energy RMSE/Natoms和Force RMSE对比结果看,同样的条件下基于预训练微调后的训练的finetune模型精度更高。

代码
文本

我们本次指南就到这里结束啦。怎么样,是不是对DPA-1进行预训练模型已经跃跃欲试了?尝试学以致用在自己的课题上吧~

代码
文本

这天,导师对你的模型训练策略赞赏有加👍。

这时,你不禁又思考了一个问题:模型已经训练完成,我该如何应用已有模型进行分子动力学模拟呢?

ddd! 欢迎阅读:

代码
文本
AI4S
DeePMD
DPA-1
AI4SDeePMDDPA-1
已赞7
本文被以下合集收录
材料计算
虚白
更新于 2024-08-25
20 篇5 人关注
DEEPMD
bohrb1cdbc
更新于 2024-08-25
12 篇0 人关注
推荐阅读
公开
哥伦布训练营|DPA-1——固态电解质实战之模型训练&性质计算篇
DeePMDDPA固态电解质
DeePMDDPA固态电解质
zhanglinshuang
发布于 2023-10-27
8 赞14 转存文件1 评论
公开
DeePTB 快速上手指南 | 训练 Silicon 的紧束缚模型
DeePTBMachine LearningTutorialElectronic Structure
DeePTBMachine LearningTutorialElectronic Structure
周寅张皓
发布于 2023-07-18
9 赞18 转存文件
评论
 --- 📐一天,导师要求通过分子动力学...

Wentao

2023-08-01
要把CMD AIMD MLMD的全称打出来哦!

Wentao

2023-08-01
可以给一个DP GEN的链接(注意要大写!

ZhexuanS

作者
2023-08-02
本教程没有用到DP-GEN,所以没放~
评论
 --- 📣 近日,深势科技以及北京科学...

Wentao

2023-08-01
微信推文来个链接~
评论
 现在,你已经知道了DPA-1是一个基于注...

shimengchao@dp.tech

10-25 07:01
在介绍整个下面的1,2,3,4时需要把整个notebook的架构给受众介绍清楚。
评论
 相比于之前的dp模型,DPA-1模型在研...

shimengchao@dp.tech

10-25 08:08
最后一句:可以进行多元素体系的大规模分子动力学模拟
评论
 ### **2.3 实验验证** **迁...

shimengchao@dp.tech

10-25 06:57
"DPA-1在仅有少量三元数据的场景下,也达到了较高的精度,对比DeepPot-SE可以节省大约90%的三元数据。" 即使没有任何三元数据,DPA-1的能量误差比SE版本低1个数量级
评论
 ## **3. 固态电解质实战: DPA...

Wentao

2023-08-01
可以介绍一下这个数据集吗~

黄剑兴

10-25 04:34
这个数据集是三个固体电解质的势函数,Li10GeP2S12, Li10SiP2S12, Li10SnP2S12。包括用于dpgen的训练每轮训练数据和最后的训练模型

shimengchao@dp.tech

10-25 06:59
可以贴一下paper的链接

bfkdzn

04-30 06:53
请问一下有DPA1的安装教程吗
评论
 ! tree -L 3

Wentao

2023-08-01
这个tree的输出超级超级长……
评论
 # 准备输入文件 ! cat ./dpa...

量子御坂

2023-08-06
这里sel的取值是不是要参考`dp neighbor-stat`的结果或者是直接取auto?不同deep-potential模型的这个sel取值会有不同吗?

ZhexuanS

作者
2023-08-06
deepmd-kit/doc/model/sel.md at 617035662b1cb72e72a4cbc7c0fd3ad8a846657e · deepmodeling/deepmd-kit (github.com)

ZhexuanS

作者
2023-08-06
回复 ZhexuanS 第一个问题是的,不过这里的sel相比于se_e2来说是一个Int型的,所以不需要根据每个元素去设置成List形式;对于直接取auto,可以从这里看到auto对应的含义:deepmd-kit/doc/train-input-auto.rst at 617035662b1cb72e72a4cbc7c0fd3ad8a846657e · deepmodeling/deepmd-kit (github.com)
展开

ZhexuanS

作者
2023-08-06
回复 量子御坂 第二个问题我认为跟描述子的关系可能不大,跟你的体系本身关系比较大——元素组成以及截断距离内原子个数。一搬可以参考推荐的示例

量子御坂

2023-08-07
上面给到的链接都打不开
评论
 # dpa模型训练;采用c12_m92_...

Wentao

2023-08-01
这个块最好写一下训练时间大概有多长~这样读者可以大概有个概念

ZhexuanS

作者
2023-08-02
回复 Wentao OK
评论
 # 基于dpa.pb的dpa_finet...

MileAway

2023-07-24
这个输出或许可以设置成滚动模式~

ZhexuanS

作者
2023-07-24
回复 MileAway 已修改( ̄︶ ̄)↗
评论
 ### 3.5 模型检验 现在,我们已经...

Wentao

2023-08-01
什么是lcurve呀

ZhexuanS

作者
2023-08-02
回复 Wentao 已经注明~,更详细的信息在开头建议阅读的deepmd-kit教程中
评论
 --- 这天,导师对你的模型训练策略赞赏...

Wentao

2023-08-01
点这个链接会进入原本的notebook但不是案例广场的链接~

zhangpengchao@dp.tech

2023-08-06
请问目前的dpa_finetune.pd的推理速度,可以与常规的se_e2_a.pd相当吗

量子御坂

2023-08-06
从测试结果上看,dpa目前不能compress,而且就算是和不compress只freeze的se_e2_a模型对比,计算速度也慢了1-2个数量级?

ZhexuanS

作者
2023-08-06
回复 zhangpengchao@dp.tech 这个我没有系统测试过,但是碰到的案例来说跟体系和参数设置关系比较大;建议直接发邮件问下DPA-1的作者张铎。
评论