Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
DPA2上机实战
DPA
Tutorial
DPATutorial
2043899742@qq.com
发布于 2023-11-28
推荐镜像 :dpa2-tutorial:v3
推荐机型 :c4_m15_1 * NVIDIA T4
赞 7
12
18
DPA2上机实战
1. 学习目标
2. DPA-2简介
2.1 研究背景
2.2 研究方法
2.3 实验验证
3. DPA2实战
3.1 内容介绍
3.2 输入脚本准备
3.3 模型训练(从头训)
3.4 模型微调训练
3.5 模型检验
4. 参考来源

DPA2上机实战

代码
文本

Open In Bohrium

代码
文本

©️ Copyright 2023 @ Authors
作者: 张成谦📨
日期:2023-11-30
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:你可以点击界面上方蓝色按钮 开始连接 ,选择 `dpa2-tutorial:v3` 镜像及`c4_m15_1 * NVIDIA T4`节点配置,稍等片刻即可运行。

代码
文本

💭在阅读本notebook之前,建议读者先阅读DeePMD和DPA-1的相关教程,本文将不再详细介绍输入文件中的各项参数的含义:

  1. 哥伦布训练营|DPA-1——固态电解质实战之模型训练&性质计算篇
  2. 快速开始 DeePMD-kit|训练甲烷深度势能分子动力学模型
代码
文本

1. 学习目标

在本教程学习后,你将获得:

  1. DPA-2基本原理和应用;
  2. 以H2O-SCAN0数据集为例,进行DPA-2势函数模型训练实战:输入脚本解读;从头训 vs. 已有预训练模型微调;
代码
文本

2. DPA-2简介

👂 迫不及待动手实践?可直接跳转至第3节~

代码
文本

2.1 研究背景

一直以来,势函数训练都在追求精度和效率的平衡。使用经典力场势函数方便快捷,但模拟精度难以更上一层楼;使用近来火热的AIMD(从头算分子动力学),势函数精度获得大幅提升,但计算资源花费难以在大体系、长时间的场景落地。随着AI for science的发展,机器学习手段使得训练高精度、高效率的势函数成为可能(下图:分子动力学模拟对比)。在MLMD的新范式下,量子化学计算(QM)不再直接应用于AIMD,而是作为生成机器学习势函数(MLP)的数据集准备; 当然,AIMD的计算结果也可以作为初始数据集。

image.png

ref. Machine learning-accelerated quantum mechanics-based atomistic simulations for industrial applications

然而,由于现有模型迁移能力不足、缺乏通用的大模型,面对一个新的复杂体系,要获得可用的、较为完备的势函数模型,科学家们基本上仍然需要获取大量计算数据并从头开始训练模型。随着电子结构数据的积累,类比计算机视觉(CV)或者自然语言处理(NLP)等其他人工智能领域的发展,**“预训练+少量数据微调”**是解决这个难题比较自然的想法。

为了实现这一范式,我们亟需一种具有强迁移能力、能容纳元素周期表大多数元素的模型结构。

代码
文本

2.2 研究方法

代码
文本

DPA-2模型是继DPA-1之后对DP系列模型的又一次全面升级:

一方面是DPA-2使用了多任务训练策略(multi-task training),可以同时使用DFT设置不同的多个数据集进行预训练,在对下游任务进行微调(finetune)时,模型的主干(将构型空间和化学空间的表示进行编码的部分)将被保留,并在后边连接到一个或多个头部网络。因此,预训练的数据集和微调的数据集的标注方法不必完全相同。经过微调得到的模型参数量较多,直接应用于生产场景(如MD模拟)时可能会导致效率低下,为了解决这一问题,我们可以将模型蒸馏(distillation)为一个参数量较少的模型,既能保持下游任务的准确性,又能提高速度,从而可以进行大规模的模拟。(Figure1:预训练,微调,蒸馏的流程图)

代码
文本

alt image.png

代码
文本

另一方面,DPA-2进一步改进了模型结构,对原子间的相互作用实现了更为充分的建模,通过在合金、半导体、电池材料和药物分子等数据集上的预训练,能够学习到更多隐藏的原子交互信息,极大提升了模型在包含不同构象、不同组分的数据集之间的迁移能力。我们将模型在18个不同的数据集上进行了预训练,并将此预训练模型在各种下游任务上进行了迁移学习,实验表明,与DPA-1模型相比,DPA-2预训练模型能进一步大幅降低下游任务训练所需数据量及训练成本、提高模型预测精度(Figure2:DPA-2模型结构示意图)。

代码
文本

alt image.png

代码
文本

2.3 实验验证

代码
文本

数据集概览

代码
文本

alt image.png

代码
文本

我们将各个数据集在预训练模型中的表示(descriptor)可视化(使用t-SNE方法),结果如下图所示:

代码
文本

alt img_v3_025j_f2abcf80-d0cb-45b0-afd0-3b8b32ad2abg.jpg

代码
文本

下游任务微调

研究者利用下游数据集对 DPA-1、DPA-2 和 multitask pretrained DPA-2模型进行了采样效率测试,观察随着下游任务数据量的增加,收敛之后的能量和力的RMSE的变化趋势(Figure 3: 下游数据集的结果)。

从图中可以看出:

1.从头训练的DPA-2模型比DPA-1模型有更好的收敛精度,尤其是在数据量足够大的情况下,这一结果凸显了DPA-2模型结构的优越性。

2.当进行多任务预训练(multitask pretrain)时,经过微调的DPA-2模型可能会产生比从头训练的DPA-2模型低得多的曲线,尤其是在下游数据有限的情况下。在 H2O -SCAN0 等数据集上,即使是零点均方根误差(RMSE)也足够精确。

代码
文本

alt image.png

代码
文本

3. DPA2实战

学习了理论知识后,让我们直接动手实践吧! 本节,我们将以数据集H20-SCAN0为例,开展DPA-2的从头训和微调训练。

注:本教程所使用的数据集源自科学智能广场(AIS-Square),有更多模型和数据需求的同学赶快去探索一下吧~

代码
文本

3.1 内容介绍

代码
文本
[15]
cd /root/dpa2
/root/dpa2
代码
文本
[17]
!tree -L 2
.
├── data
│   └── H2O-scan0
├── finetune
│   └── input.json
├── from_scratch
│   └── input.json
└── pretrain_model
    └── model.pt

5 directories, 3 files
代码
文本

让我们一起看看教程文档包含的内容:

data: H2O-SCAN0数据集

finetune: 基于预训练模型的微调训练的目录,input.json为输入文件

from_scratch: DPA-2从头训练的目录,input.json为输入文件

pretrain_model: 预训练模型所在的目录,model.pt为预训练模型

代码
文本

3.2 输入脚本准备

代码
文本

从头训和模型微调训练的输入文件在此例子下是完全相同的:

代码
文本
[18]
cat from_scratch/input.json
{
 "_comment": "that's all",
 "model": {
  "type_embedding": {
   "neuron": [
    8
   ],
   "tebd_input_mode": "concat"
  },
  "type_map": [
   "H",
   "He",
   "Li",
   "Be",
   "B",
   "C",
   "N",
   "O",
   "F",
   "Ne",
   "Na",
   "Mg",
   "Al",
   "Si",
   "P",
   "S",
   "Cl",
   "Ar",
   "K",
   "Ca",
   "Sc",
   "Ti",
   "V",
   "Cr",
   "Mn",
   "Fe",
   "Co",
   "Ni",
   "Cu",
   "Zn",
   "Ga",
   "Ge",
   "As",
   "Se",
   "Br",
   "Kr",
   "Rb",
   "Sr",
   "Y",
   "Zr",
   "Nb",
   "Mo",
   "Tc",
   "Ru",
   "Rh",
   "Pd",
   "Ag",
   "Cd",
   "In",
   "Sn",
   "Sb",
   "Te",
   "I",
   "Xe",
   "Cs",
   "Ba",
   "La",
   "Ce",
   "Pr",
   "Nd",
   "Pm",
   "Sm",
   "Eu",
   "Gd",
   "Tb",
   "Dy",
   "Ho",
   "Er",
   "Tm",
   "Yb",
   "Lu",
   "Hf",
   "Ta",
   "W",
   "Re",
   "Os",
   "Ir",
   "Pt",
   "Au",
   "Hg",
   "Tl",
   "Pb",
   "Bi",
   "Po",
   "At",
   "Rn",
   "Fr",
   "Ra",
   "Ac",
   "Th",
   "Pa",
   "U",
   "Np",
   "Pu",
   "Am",
   "Cm",
   "Bk",
   "Cf",
   "Es",
   "Fm",
   "Md",
   "No",
   "Lr",
   "Rf",
   "Db",
   "Sg",
   "Bh",
   "Hs",
   "Mt",
   "Ds",
   "Rg",
   "Cn",
   "Nh",
   "Fl",
   "Mc",
   "Lv",
   "Ts",
   "Og"
  ],
  "descriptor": {
   "type": "hybrid",
   "hybrid_mode": "sequential",
   "list": [
    {
     "type": "se_atten",
     "sel": 120,
     "rcut_smth": 8.0,
     "rcut": 9.0,
     "neuron": [
      25,
      50,
      100
     ],
     "resnet_dt": false,
     "axis_neuron": 12,
     "seed": 1,
     "attn": 128,
     "attn_layer": 0,
     "attn_dotr": true,
     "attn_mask": false,
     "post_ln": true,
     "ffn": false,
     "ffn_embed_dim": 1024,
     "activation": "tanh",
     "scaling_factor": 1.0,
     "head_num": 1,
     "normalize": true,
     "temperature": 1.0,
     "add": "concat",
     "pre_add": true,
     "_comment": " that's all"
    },
    {
     "type": "se_uni",
     "sel": 40,
     "rcut_smth": 3.5,
     "rcut": 4.0,
     "nlayers": 12,
     "g1_dim": 128,
     "g2_dim": 32,
     "attn2_hidden": 32,
     "attn2_nhead": 4,
     "attn1_hidden": 128,
     "attn1_nhead": 4,
     "axis_dim": 4,
     "update_h2": false,
     "update_g1_has_conv": true,
     "update_g1_has_grrg": true,
     "update_g1_has_drrd": true,
     "update_g1_has_attn": true,
     "update_g2_has_g1g1": true,
     "update_g2_has_attn": true,
     "attn2_has_gate": true,
     "add_type_ebd_to_seq": false,
     "smooth": true,
     "_comment": " that's all"
    }
   ]
  },
  "fitting_net": {
   "neuron": [
    240,
    240,
    240
   ],
   "resnet_dt": true,
   "seed": 1,
   "_comment": " that's all"
  },
  "_comment": " that's all"
 },
 "learning_rate": {
  "type": "exp",
  "decay_steps": 1,
  "start_lr": 0.0002,
  "stop_lr": 3.51e-08,
  "_comment": "that's all"
 },
 "loss": {
  "type": "ener",
  "start_pref_e": 0.02,
  "limit_pref_e": 1,
  "start_pref_f": 1000,
  "limit_pref_f": 1,
  "start_pref_v": 0,
  "limit_pref_v": 0,
  "_comment": " that's all"
 },
 "training": {
  "training_data": {
   "systems": [
                "/root/dpa2/data/H2O-scan0/data1/train",
                "/root/dpa2/data/H2O-scan0/data10/train",
                "/root/dpa2/data/H2O-scan0/data11/train",
                "/root/dpa2/data/H2O-scan0/data12/train",
                "/root/dpa2/data/H2O-scan0/data13/train",
                "/root/dpa2/data/H2O-scan0/data14/train",
                "/root/dpa2/data/H2O-scan0/data15/train",
                "/root/dpa2/data/H2O-scan0/data16/train",
                "/root/dpa2/data/H2O-scan0/data17/train",
                "/root/dpa2/data/H2O-scan0/data18/train",
                "/root/dpa2/data/H2O-scan0/data19/train",
                "/root/dpa2/data/H2O-scan0/data2/train",
                "/root/dpa2/data/H2O-scan0/data20/train",
                "/root/dpa2/data/H2O-scan0/data21/train",
                "/root/dpa2/data/H2O-scan0/data22/train",
                "/root/dpa2/data/H2O-scan0/data23/train",
                "/root/dpa2/data/H2O-scan0/data24/train",
                "/root/dpa2/data/H2O-scan0/data25/train",
                "/root/dpa2/data/H2O-scan0/data26/train",
                "/root/dpa2/data/H2O-scan0/data27/train",
                "/root/dpa2/data/H2O-scan0/data28/train",
                "/root/dpa2/data/H2O-scan0/data29/train",
                "/root/dpa2/data/H2O-scan0/data3/train",
                "/root/dpa2/data/H2O-scan0/data30/train",
                "/root/dpa2/data/H2O-scan0/data31/train",
                "/root/dpa2/data/H2O-scan0/data32/train",
                "/root/dpa2/data/H2O-scan0/data33/train",
                "/root/dpa2/data/H2O-scan0/data34/train",
                "/root/dpa2/data/H2O-scan0/data35/train",
                "/root/dpa2/data/H2O-scan0/data36/train",
                "/root/dpa2/data/H2O-scan0/data37/train",
                "/root/dpa2/data/H2O-scan0/data38/train",
                "/root/dpa2/data/H2O-scan0/data39/train",
                "/root/dpa2/data/H2O-scan0/data4/train",
                "/root/dpa2/data/H2O-scan0/data40/train",
                "/root/dpa2/data/H2O-scan0/data41/train",
                "/root/dpa2/data/H2O-scan0/data42/train",
                "/root/dpa2/data/H2O-scan0/data43/train",
                "/root/dpa2/data/H2O-scan0/data44/train",
                "/root/dpa2/data/H2O-scan0/data45/train",
                "/root/dpa2/data/H2O-scan0/data5/train",
                "/root/dpa2/data/H2O-scan0/data6/train",
                "/root/dpa2/data/H2O-scan0/data7/train",
                "/root/dpa2/data/H2O-scan0/data8/train",
                "/root/dpa2/data/H2O-scan0/data9/train",
                "/root/dpa2/data/H2O-scan0/data_ex1/train",
                "/root/dpa2/data/H2O-scan0/data_ex2/train",
                "/root/dpa2/data/H2O-scan0/data_ex3/train",
                "/root/dpa2/data/H2O-scan0/data_ex4/train",
                "/root/dpa2/data/H2O-scan0/data_ex5/train",
                "/root/dpa2/data/H2O-scan0/data_ex6/train"
   ],
   "batch_size": "auto",
   "_comment": "that's all"
  },
  "validation_data": {
   "systems": [
                "/root/dpa2/data/H2O-scan0/data1/valid",
                "/root/dpa2/data/H2O-scan0/data10/valid",
                "/root/dpa2/data/H2O-scan0/data11/valid",
                "/root/dpa2/data/H2O-scan0/data12/valid",
                "/root/dpa2/data/H2O-scan0/data13/valid",
                "/root/dpa2/data/H2O-scan0/data14/valid",
                "/root/dpa2/data/H2O-scan0/data15/valid",
                "/root/dpa2/data/H2O-scan0/data16/valid",
                "/root/dpa2/data/H2O-scan0/data17/valid",
                "/root/dpa2/data/H2O-scan0/data18/valid",
                "/root/dpa2/data/H2O-scan0/data19/valid",
                "/root/dpa2/data/H2O-scan0/data2/valid",
                "/root/dpa2/data/H2O-scan0/data20/valid",
                "/root/dpa2/data/H2O-scan0/data21/valid",
                "/root/dpa2/data/H2O-scan0/data22/valid",
                "/root/dpa2/data/H2O-scan0/data23/valid",
                "/root/dpa2/data/H2O-scan0/data24/valid",
                "/root/dpa2/data/H2O-scan0/data25/valid",
                "/root/dpa2/data/H2O-scan0/data26/valid",
                "/root/dpa2/data/H2O-scan0/data27/valid",
                "/root/dpa2/data/H2O-scan0/data28/valid",
                "/root/dpa2/data/H2O-scan0/data29/valid",
                "/root/dpa2/data/H2O-scan0/data3/valid",
                "/root/dpa2/data/H2O-scan0/data30/valid",
                "/root/dpa2/data/H2O-scan0/data31/valid",
                "/root/dpa2/data/H2O-scan0/data32/valid",
                "/root/dpa2/data/H2O-scan0/data33/valid",
                "/root/dpa2/data/H2O-scan0/data34/valid",
                "/root/dpa2/data/H2O-scan0/data35/valid",
                "/root/dpa2/data/H2O-scan0/data36/valid",
                "/root/dpa2/data/H2O-scan0/data37/valid",
                "/root/dpa2/data/H2O-scan0/data38/valid",
                "/root/dpa2/data/H2O-scan0/data39/valid",
                "/root/dpa2/data/H2O-scan0/data4/valid",
                "/root/dpa2/data/H2O-scan0/data40/valid",
                "/root/dpa2/data/H2O-scan0/data41/valid",
                "/root/dpa2/data/H2O-scan0/data42/valid",
                "/root/dpa2/data/H2O-scan0/data43/valid",
                "/root/dpa2/data/H2O-scan0/data44/valid",
                "/root/dpa2/data/H2O-scan0/data45/valid",
                "/root/dpa2/data/H2O-scan0/data5/valid",
                "/root/dpa2/data/H2O-scan0/data6/valid",
                "/root/dpa2/data/H2O-scan0/data7/valid",
                "/root/dpa2/data/H2O-scan0/data8/valid",
                "/root/dpa2/data/H2O-scan0/data9/valid",
                "/root/dpa2/data/H2O-scan0/data_ex1/valid",
                "/root/dpa2/data/H2O-scan0/data_ex2/valid",
                "/root/dpa2/data/H2O-scan0/data_ex3/valid",
                "/root/dpa2/data/H2O-scan0/data_ex4/valid",
                "/root/dpa2/data/H2O-scan0/data_ex5/valid",
                "/root/dpa2/data/H2O-scan0/data_ex6/valid"
   ],
   "batch_size": 1,
   "_comment": "that's all"
  },
  "numb_steps": 200,
  "warmup_steps": 0,
  "gradient_max_norm": 5.0,
  "seed": 10,
  "disp_file": "lcurve.out",
  "disp_freq": 25,
  "save_freq": 200,
  "_comment": "that's all",
  "wandb_config": {
   "wandb_enabled": false,
   "entity": "dp_model_engineering",
   "project": "DPA"
  }
 }
}
代码
文本

相比于DPA-1模型,DPA-2主要不同的参数集中在descriptor部分

"descriptor": {
   "type": "hybrid",
   "hybrid_mode": "sequential",
   "list": [
    {
     "type": "se_atten",
     "sel": 120,
     "rcut_smth": 8.0,
     "rcut": 9.0,
     "neuron": [
      25,
      50,
      100
     ],
     "resnet_dt": false,
     "axis_neuron": 12,
     "seed": 1,
     "attn": 128,
     "attn_layer": 0,
     "attn_dotr": true,
     "attn_mask": false,
     "post_ln": true,
     "ffn": false,
     "ffn_embed_dim": 1024,
     "activation": "tanh",
     "scaling_factor": 1.0,
     "head_num": 1,
     "normalize": true,
     "temperature": 1.0,
     "add": "concat",
     "pre_add": true,
     "_comment": " that's all"
    },
    {
     "type": "se_uni",
     "sel": 40,
     "rcut_smth": 3.5,
     "rcut": 4.0,
     "nlayers": 12,
     "g1_dim": 128,
     "g2_dim": 32,
     "attn2_hidden": 32,
     "attn2_nhead": 4,
     "attn1_hidden": 128,
     "attn1_nhead": 4,
     "axis_dim": 4,
     "update_h2": false,
     "update_g1_has_conv": true,
     "update_g1_has_grrg": true,
     "update_g1_has_drrd": true,
     "update_g1_has_attn": true,
     "update_g2_has_g1g1": true,
     "update_g2_has_attn": true,
     "attn2_has_gate": true,
     "add_type_ebd_to_seq": false,
     "smooth": true,
     "_comment": " that's all"
    }
   ]
  },
代码
文本

相比于之前大家常用的se_e2_a描述子和DPA-1模型的se_atten描述子来说,DPA-2采用了se_attn和se_uni的混合描述子,其中两个描述子为串联关系,即se_attn描述子的输出作为se_uni描述子的输入。具体的模型构造可参考Figure 2。

代码
文本

3.3 模型训练(从头训)

代码
文本
[19]
cd from_scratch
/root/dpa2/from_scratch
代码
文本
[20]
!dp_pt train input.json
2023-12-01 11:12:25.880822: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-01 11:12:25.880908: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-01 11:12:25.984780: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-01 11:12:26.194761: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-01 11:12:27.714354: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /opt/mamba/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-12-01 11:12:30,608  [main.py:170] INFO DeepMD version: 0.1.3.dev254+g24d5796
2023-12-01 11:12:30,610  [main.py:133] INFO Configuration path: input.json
2023-12-01 11:12:30,843  [stat.py:37] INFO Packing data for statistics from 51 systems
100%|███████████████████████████████████████████| 51/51 [00:10<00:00,  4.83it/s]
2023-12-01 11:12:41,411  [dataloader.py:279] INFO Generated weighted sampler with prob array: [0.05427021 0.01356755 0.01356755 0.01356755 0.0271351  0.0271351
 0.01356755 0.04027421 0.03484719 0.01242502 0.01356755 0.0271351
 0.02442159 0.02142245 0.02113682 0.01756641 0.013996   0.00671237
 0.00585547 0.0087118  0.0061411  0.00542702 0.01856612 0.020994
 0.01628106 0.01685233 0.01799486 0.01628106 0.01085404 0.01985147
 0.0165667  0.0157098  0.01899457 0.01356755 0.01628106 0.01628106
 0.0524136  0.03227649 0.04070266 0.02913453 0.01342474 0.01328192
 0.01356755 0.01356755 0.01356755 0.05384176 0.01356755 0.01356755
 0.01356755 0.01342474 0.01256784]
2023-12-01 11:12:41,412  [dataloader.py:279] INFO Generated weighted sampler with prob array: [0.05763689 0.01440922 0.01440922 0.01440922 0.02881844 0.02881844
 0.01440922 0.04034582 0.03458213 0.01152738 0.01440922 0.02881844
 0.0259366  0.02017291 0.02017291 0.01729107 0.01440922 0.00576369
 0.00576369 0.00864553 0.00576369 0.00288184 0.01729107 0.02017291
 0.01729107 0.01729107 0.01729107 0.01440922 0.01152738 0.02017291
 0.01729107 0.01440922 0.01729107 0.01440922 0.01440922 0.01729107
 0.05475504 0.03170029 0.04034582 0.02881844 0.01152738 0.01152738
 0.01440922 0.01440922 0.01440922 0.05475504 0.01152738 0.01440922
 0.01440922 0.01152738 0.01152738]
2023-12-01 11:12:47,991  [model.py:53] INFO Saving stat file to stat_files/stat_file_rcut9.00_smth8.00_sel120_se_atten.npz
2023-12-01 11:12:47,994  [model.py:53] INFO Saving stat file to stat_files/stat_file_rcut4.00_smth3.50_sel40_se_uni.npz
2023-12-01 11:12:48,019  [ener.py:45] INFO Set seed to 1 in fitting net.
2023-12-01 11:12:48,055  [training.py:359] INFO Start to train 200 steps.
2023-12-01 11:12:51,775  [training.py:510] INFO step=0, lr=2.00e-04             
loss=4719.2101, rmse_train=68.6965, rmse_e_train=0.9473, rmse_f_train=2.1716, rmse_valid=59.3087, rmse_e_valid=0.7322, rmse_f_valid=1.8750, speed=3.72 s/1 batches
2023-12-01 11:12:59,114  [training.py:510] INFO step=25, lr=6.79e-05            
loss=1148.5740, rmse_train=33.8906, rmse_e_train=0.2380, rmse_f_train=1.8324, rmse_valid=49.8568, rmse_e_valid=0.2527, rmse_f_valid=2.6997, speed=7.34 s/25 batches
2023-12-01 11:13:06,258  [training.py:510] INFO step=50, lr=2.30e-05            
loss=149.6035, rmse_train=12.2312, rmse_e_train=0.0317, rmse_f_train=1.1351, rmse_valid=13.1363, rmse_e_valid=0.0566, rmse_f_valid=1.2178, speed=7.14 s/25 batches
2023-12-01 11:13:13,421  [training.py:510] INFO step=75, lr=7.81e-06            
loss=25.6861, rmse_train=5.0681, rmse_e_train=0.0285, rmse_f_train=0.7989, rmse_valid=5.4680, rmse_e_valid=0.0385, rmse_f_valid=0.8605, speed=7.16 s/25 batches
2023-12-01 11:13:20,500  [training.py:510] INFO step=100, lr=2.65e-06           
loss=10.3940, rmse_train=3.2240, rmse_e_train=0.0164, rmse_f_train=0.8524, rmse_valid=3.9693, rmse_e_valid=0.0351, rmse_f_valid=1.0442, speed=7.08 s/25 batches
2023-12-01 11:13:28,028  [training.py:510] INFO step=125, lr=8.99e-07           
loss=4.2112, rmse_train=2.0521, rmse_e_train=0.0061, rmse_f_train=0.8751, rmse_valid=2.0618, rmse_e_valid=0.0061, rmse_f_valid=0.8792, speed=7.53 s/25 batches
2023-12-01 11:13:35,140  [training.py:510] INFO step=150, lr=3.05e-07           
loss=1.7295, rmse_train=1.3151, rmse_e_train=0.0168, rmse_f_train=0.8149, rmse_valid=1.5815, rmse_e_valid=0.0261, rmse_f_valid=0.9694, speed=7.11 s/25 batches
2023-12-01 11:13:42,314  [training.py:510] INFO step=175, lr=1.03e-07           
loss=1.3723, rmse_train=1.1715, rmse_e_train=0.0281, rmse_f_train=0.8972, rmse_valid=1.2020, rmse_e_valid=0.0001, rmse_f_valid=0.9760, speed=7.17 s/25 batches
2023-12-01 11:13:49,239  [training.py:529] INFO Saved model to model_200.pt     
100%|█████████████████████████████████████████| 200/200 [01:01<00:00,  3.27it/s]
2023-12-01 11:13:49,245  [training.py:562] INFO Trained model has been saved to: model.pt
代码
文本

3.4 模型微调训练

代码
文本
[21]
cd ../finetune
/root/dpa2/finetune
代码
文本

微调训练的命令增加了--finetune ../pretrain_model/model.pt选项,表示在模型../pretrain_model/model.pt的基础上微调。

除此之外,命令还增加了-m H2O_H2O-PD选项,原因是我们使用的数据集H2O-SCAN0与预训练数据集H2O_H2O-PD较为接近,所以我们使用预训练得到的H2O_H2O-PD的fitting net的参数来初始化我们微调的模型的fitting net的参数。

代码
文本
[22]
!dp_pt train input.json --finetune ../pretrain_model/model.pt -m H2O_H2O-PD
2023-12-01 11:14:28.921978: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-01 11:14:28.922049: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-01 11:14:28.923370: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-01 11:14:28.931242: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-01 11:14:29.940467: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /opt/mamba/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-12-01 11:14:31,822  [main.py:170] INFO DeepMD version: 0.1.3.dev254+g24d5796
2023-12-01 11:14:31,824  [main.py:133] INFO Configuration path: input.json
2023-12-01 11:14:36,380  [finetune.py:55] INFO Change the model configurations according to the model branch H2O_H2O-PD in the pretrained one...
2023-12-01 11:14:36,580  [dataloader.py:279] INFO Generated weighted sampler with prob array: [0.05427021 0.01356755 0.01356755 0.01356755 0.0271351  0.0271351
 0.01356755 0.04027421 0.03484719 0.01242502 0.01356755 0.0271351
 0.02442159 0.02142245 0.02113682 0.01756641 0.013996   0.00671237
 0.00585547 0.0087118  0.0061411  0.00542702 0.01856612 0.020994
 0.01628106 0.01685233 0.01799486 0.01628106 0.01085404 0.01985147
 0.0165667  0.0157098  0.01899457 0.01356755 0.01628106 0.01628106
 0.0524136  0.03227649 0.04070266 0.02913453 0.01342474 0.01328192
 0.01356755 0.01356755 0.01356755 0.05384176 0.01356755 0.01356755
 0.01356755 0.01342474 0.01256784]
2023-12-01 11:14:36,580  [dataloader.py:279] INFO Generated weighted sampler with prob array: [0.05763689 0.01440922 0.01440922 0.01440922 0.02881844 0.02881844
 0.01440922 0.04034582 0.03458213 0.01152738 0.01440922 0.02881844
 0.0259366  0.02017291 0.02017291 0.01729107 0.01440922 0.00576369
 0.00576369 0.00864553 0.00576369 0.00288184 0.01729107 0.02017291
 0.01729107 0.01729107 0.01729107 0.01440922 0.01152738 0.02017291
 0.01729107 0.01440922 0.01729107 0.01440922 0.01440922 0.01729107
 0.05475504 0.03170029 0.04034582 0.02881844 0.01152738 0.01152738
 0.01440922 0.01440922 0.01440922 0.05475504 0.01152738 0.01440922
 0.01440922 0.01152738 0.01152738]
2023-12-01 11:14:37,717  [ener.py:45] INFO Set seed to 1 in fitting net.
2023-12-01 11:14:37,740  [training.py:238] INFO Resuming from ../pretrain_model/model.pt.
2023-12-01 11:14:38,291  [fitting.py:101] INFO Changing energy bias in pretrained model for types ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr', 'Rf', 'Db', 'Sg', 'Bh', 'Hs', 'Mt', 'Ds', 'Rg', 'Cn', 'Nh', 'Fl', 'Mc', 'Lv', 'Ts', 'Og']... (this step may take long time)
2023-12-01 11:14:38,424  [stat.py:37] INFO Packing data for statistics from 51 systems
100%|███████████████████████████████████████████| 51/51 [00:02<00:00, 17.08it/s]
2023-12-01 11:14:48,994  [fitting.py:166] INFO RMSE of atomic energy after linear regression is: 5.49232e-04 eV/atom.
2023-12-01 11:14:48,996  [fitting.py:178] INFO Change energy bias of ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr', 'Rf', 'Db', 'Sg', 'Bh', 'Hs', 'Mt', 'Ds', 'Rg', 'Cn', 'Nh', 'Fl', 'Mc', 'Lv', 'Ts', 'Og'] from [-6.338534  0.        0.        0.        0.        0.        0.
 -3.169267  0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.      ] to [-187.04663     0.          0.          0.          0.          0.
    0.        -93.523315    0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.      ].
2023-12-01 11:14:49,010  [training.py:359] INFO Start to train 200 steps.
2023-12-01 11:14:49,735  [training.py:510] INFO step=0, lr=2.00e-04             
loss=4.6365, rmse_train=2.1533, rmse_e_train=0.0002, rmse_f_train=0.0681, rmse_valid=34.4455, rmse_e_valid=0.2252, rmse_f_valid=1.0892, speed=0.73 s/1 batches
2023-12-01 11:14:57,273  [training.py:510] INFO step=25, lr=6.79e-05            
loss=12.2519, rmse_train=3.5003, rmse_e_train=0.0112, rmse_f_train=0.1897, rmse_valid=2.8736, rmse_e_valid=0.0033, rmse_f_valid=0.1558, speed=7.54 s/25 batches
2023-12-01 11:15:04,581  [training.py:510] INFO step=50, lr=2.30e-05            
loss=1.9723, rmse_train=1.4044, rmse_e_train=0.0195, rmse_f_train=0.1282, rmse_valid=1.1252, rmse_e_valid=0.0116, rmse_f_valid=0.1035, speed=7.31 s/25 batches
2023-12-01 11:15:11,969  [training.py:510] INFO step=75, lr=7.81e-06            
loss=0.3784, rmse_train=0.6151, rmse_e_train=0.0019, rmse_f_train=0.0972, rmse_valid=0.5964, rmse_e_valid=0.0039, rmse_f_valid=0.0939, speed=7.39 s/25 batches
2023-12-01 11:15:19,280  [training.py:510] INFO step=100, lr=2.65e-06           
loss=0.1231, rmse_train=0.3508, rmse_e_train=0.0062, rmse_f_train=0.0902, rmse_valid=0.3211, rmse_e_valid=0.0017, rmse_f_valid=0.0849, speed=7.31 s/25 batches
2023-12-01 11:15:26,657  [training.py:510] INFO step=125, lr=8.99e-07           
loss=0.0388, rmse_train=0.1970, rmse_e_train=0.0023, rmse_f_train=0.0830, rmse_valid=0.2276, rmse_e_valid=0.0014, rmse_f_valid=0.0968, speed=7.38 s/25 batches
2023-12-01 11:15:33,982  [training.py:510] INFO step=150, lr=3.05e-07           
loss=0.0208, rmse_train=0.1442, rmse_e_train=0.0002, rmse_f_train=0.0907, rmse_valid=0.1755, rmse_e_valid=0.0020, rmse_f_valid=0.1092, speed=7.32 s/25 batches
2023-12-01 11:15:41,362  [training.py:510] INFO step=175, lr=1.03e-07           
loss=0.0130, rmse_train=0.1139, rmse_e_train=0.0032, rmse_f_train=0.0853, rmse_valid=0.1112, rmse_e_valid=0.0014, rmse_f_valid=0.0889, speed=7.38 s/25 batches
2023-12-01 11:15:48,539  [training.py:529] INFO Saved model to model_200.pt     
100%|█████████████████████████████████████████| 200/200 [00:59<00:00,  3.36it/s]
2023-12-01 11:15:48,545  [training.py:562] INFO Trained model has been saved to: model.pt
代码
文本

3.5 模型检验

现在,我们已经完成了DPA-2势函数的从头训和微调训练。让我们通过lcurve(学习率变化曲线输出文件)看下模型的表现吧!

📌注:由于节省时间,教程中的训练步仅为200步,没有到达一个收敛的结果;在真实训练场景中,记得调整至合适的训练时长~

代码
文本
[23]
cd /root/dpa2
/root/dpa2
代码
文本

想要直观对比从头训练和基于预训练模型的微调训练的lcurve差异?一键运行为大家准备好的可视化脚本试试~

代码
文本
[24]
import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(16,4))
data_scratch = np.genfromtxt("from_scratch/lcurve.out", names=True)
data_finetune = np.genfromtxt("finetune/lcurve.out", names=True)
for idx,ii in enumerate([1,3,5]):
plt.subplot(1,3,idx+1)
for name in data_scratch.dtype.names[ii:ii+1]:
plt.plot(data_scratch["step"], data_scratch[name], label=f"scratch_{name}")
for name in data_finetune.dtype.names[ii:ii+1]:
plt.plot(data_finetune["step"], data_finetune[name], label=f"finetune_{name}")
plt.legend()
plt.xlabel("Step")
plt.ylabel("Loss")
#plt.xscale("symlog")
plt.yscale("log")
plt.grid()
plt.show()

代码
文本

根据lcurve的变化对比,我们可以看出基于预训练模型的微调训练具有更低的能量和力损失

代码
文本

4. 参考来源

DPA-2相关:

  1. DPA-2 paper in preparation

DPA-1相关:

  1. DPA-1: Pretraining of Attention-based Deep Potential Model for Molecular Simulation
  2. DeePMD-kit’s documentation
  3. 快速上手深度势能预训练模型 DPA-1
  4. DPA-1: 共建覆盖元素周期表的预训练大模型
  5. 张铎:DPA-1预训练模型介绍&上机实践

推荐阅读DPMD系列Notebook:

  1. 哥伦布训练营|DPA-1——固态电解质实战之模型训练&性质计算篇
  2. 快速开始 DeePMD-kit|训练甲烷深度势能分子动力学模型
  3. 从 DFT 到 MD|超详细「深度势能」材料计算上手指南
  4. 使用深度势能分子动力学进行固态电解质研究实战

参考文献

  1. Han Wang, Linfeng Zhang, Jiequn Han, and Weinan E. DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics. Comput. Phys. Comm., 228:178–184, 2018. doi:10.1016/j.cpc.2018.03.016.
  2. Jinzhe Zeng, Duo Zhang, Denghui Lu, Pinghui Mo, Zeyu Li, Yixiao Chen, Marián Rynik, Li'ang Huang, Ziyao Li, Shaochen Shi, Yingze Wang, Haotian Ye, Ping Tuo, Jiabin Yang, Ye Ding, Yifan Li, Davide Tisi, Qiyu Zeng, Han Bao, Yu Xia, Jiameng Huang, Koki Muraoka, Yibo Wang, Junhan Chang, Fengbo Yuan, Sigbjørn Løland Bore, Chun Cai, Yinnian Lin, Bo Wang, Jiayan Xu, Jia-Xin Zhu, Chenxing Luo, Yuzhi Zhang, Rhys E. A. Goodall, Wenshuo Liang, Anurag Kumar Singh, Sikai Yao, Jingchao Zhang, Renata Wentzcovitch, Jiequn Han, Jie Liu, Weile Jia, Darrin M. York, Weinan E, Roberto Car, Linfeng Zhang, and Han Wang. DeePMD-kit v2: A software package for Deep Potential models. 2023. doi:10.48550/arXiv.2304.09409.
  3. Huang J, Zhang L, Wang H, Zhao J, Cheng J, E W. Deep potential generation scheme and simulation protocol for the Li10GeP2S12-type superionic conductors. J Chem Phys. 2021;154(9):094703. doi:10.1063/5.0041849
  4. https://docs.deepmodeling.com/projects/deepmd/en/master/index.html
  5. https://github.com/deepmodeling/deepmd-kit
代码
文本
DPA
Tutorial
DPATutorial
已赞7
本文被以下合集收录
DPA-2
yuxiangc22
更新于 2024-09-03
5 篇5 人关注
DPA-2教程
angelsu
更新于 2024-08-28
3 篇0 人关注
推荐阅读
公开
test
Deep Learning
Deep Learning
bulindog
发布于 2023-09-20
3 转存文件
公开
testt
test
test
bulindog
发布于 2023-09-20
4 转存文件
评论
 DPA-2模型是继DPA-1之后对DP系...

Yuliang Guo

03-10 21:42
请问相比se_e2_a,DPA2在以往就能完成的任务上有什么进步吗。模型中有涉及MPNN消息传递网络的结构吗。现在已经在github上开放下载的DPMD3.0中包含的DPA2模型已经包含了本文中所述的预训练模型了吗。
评论
 !dp_pt train input.j...

AndyX

03-19 01:22
请问您使用的deepmd-kit是什么版本?
评论
 微调训练的命令增加了`--finetun...

何为

01-14 09:48
请教一下,这个model.pt怎么获得的,另外就是 H2O_H2O-PD是模型自带的数据集,还是需要添加到自己的项目中。

FenixQian

03-05 19:09
回复 何为 这个model.pt应该是从广场下载的DPA-2大模型最新版
评论
 !dp_pt train input.j...

娜比🦋

04-29 02:23
这个input.josn有什么变化嘛
评论