©️ Copyright 2024 @ Authors
作者:John Yan (AISI电子结构团队实习生) 📨
日期:2024-9-26
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:点击上方的 开始连接 按钮,选择 deepks-abacus:3.7.5v1 镜像及 c2_m4_cpu 节点配置,并在连接后点击右上方的切换内核,选择Python (ipykernel)内核稍等片刻即可运行。
目录:
1. DeePKS简介
2. DeePKS实战工作流程
3. 配置文件介绍
4. 输出文件介绍
5. 运行DeePKS完整计算
附录:DeePKS教程资料汇总
0. 写在前边
这篇教程原则上是新手应该阅读的此系列教程的第一篇教程,以单一元素体系(所有算例的元素组成一致)的能量标签训练为例,详细介绍了DeePKS模型的各类参数和输入文件的含义和作用,以及DeePKS训练的完整工作流程。
在本篇教程的基础上,用户可以继续学习第二篇和第三篇教程,分别介绍了多标签训练和多元素体系多标签训练。同时,针对训练数据的生产,我们也准备了相应的使用DeePKS和ABACUS联动的高精度标签数据生产教程。
1.DeePKS简介
密度泛函理论(DFT)计算的精度与交换-相关(exchange-correlation, XC)泛函近似的精度相关,常用的泛函如PBE具有效率高但精度不足的问题,而杂化泛函如HSE06有精度高但效率低的问题。DeePKS(Deep Kohn-Sham)就是为了解决这一问题而提出的,以通俗的语言描述,DeePKS 方法是通过引入深度学习模型来弥补低精度泛函和高精度泛函之间的差异,从而在高效率的同时提高整体计算的精度的一种方法。
1.1 DeePKS模型是什么?
DeePKS可以泛指一个机器学习和DFT计算结合的过程,其最终产物是一个model.ptg模型文件,使用ABACUS计算时引入这个模型就可以在不显著增加时间成本的情况下大幅度提高精度。
1.2 DeePKS可以干什么?
目前DeePKS最重要的特点是可以在不显著增加时间成本的情况下大幅度提高精度,考虑到DeePKS自身训练的算力成本,目前DeePKS主要有两大应用:
1. 用于生产高精度的DeePMD的训练数据(研究体系中的原子数、元素不发生变化);
2. 大规模生产特定材料体系的高精度计算数据(研究体系中的原子数、元素均可能变化);
1.3 DeePKS怎么用?
目前针对用于生产高精度的DeePMD的训练数据这一应用已有Notebook《从 DFT 先去 DeePKS 再到 DeePMD | DeePKS基础篇》和《从 DFT 先去 DeePKS 再到 DeePMD | DeePKS案例篇 + 增强采样》可以参考,本示例将针对第二种应用进行介绍。
2 DeePKS模型实战工作流程介绍
DeePKS是机器学习与DFT计算的结合,其工作流程也主要与这两方面内容有关,其主线与通常意义上的机器学习过程类似,主要包括模型的训练(包括数据集的准备和模型训练)和模型的验证部署(引入训练好的模型进行计算),区别在于DeePKS模型训练和验证部署嵌套在一个大循环内,通过多个大循环不断使得模型的误差降低。
2.1 模型初始化过程
作为一种机器学习模型,第一步需要准备好用于训练的数据。DeePKS的训练所需要的数据主要包括: 算例的结构文件、算例的高精度计算结果标签和算例的低精度计算产生的描述符和能量。
2.1.1 算例的结构文件
这里指的是需要研究体系的结构文件,例如有1000个钙钛矿结构需要进行高精度计算,我们可以选取其中100个作为DeePKS的训练数据,首先需要将这100个算例的结构文件保存为atom.npy和box.npy,前者存储原子坐标,后者存储晶胞参数。示例如下:
原子坐标: [[[ 5.50000e+01 4.45721e+00 8.47117e+00 3.18766e+00] [ 5.50000e+01 8.88199e+00 4.27717e+00 3.19329e+00] [ 8.20000e+01 -1.10200e-02 8.39445e+00 2.18000e-03] [ 8.20000e+01 4.50257e+00 4.34769e+00 6.38983e+00] [ 5.30000e+01 3.37900e-02 8.47264e+00 3.19620e+00] [ 5.30000e+01 4.45736e+00 4.28743e+00 3.19408e+00] [ 5.30000e+01 7.32069e+00 2.82080e+00 6.38725e+00] [ 5.30000e+01 1.69104e+00 5.88711e+00 6.38902e+00] [ 5.30000e+01 6.10747e+00 7.01671e+00 6.38590e+00] [ 5.30000e+01 2.90397e+00 1.68373e+00 6.38465e+00]] [[ 5.50000e+01 4.20581e+00 3.22000e-03 3.16211e+00] [ 5.50000e+01 5.93000e-02 4.42173e+00 3.15208e+00] [ 8.20000e+01 1.12980e-01 8.84586e+00 6.31076e+00] [ 8.20000e+01 4.22028e+00 4.42597e+00 6.50000e-04] [ 5.30000e+01 9.33200e-02 8.84681e+00 3.16205e+00] [ 5.30000e+01 4.24645e+00 4.42289e+00 3.15813e+00] [ 5.30000e+01 6.91572e+00 2.80779e+00 6.31441e+00] [ 5.30000e+01 1.57360e+00 6.03491e+00 4.00000e-05] [ 5.30000e+01 5.80973e+00 7.23728e+00 6.31701e+00] [ 5.30000e+01 2.67860e+00 1.60535e+00 2.02000e-03]] [[ 5.50000e+01 4.18975e+00 -1.98300e-02 3.19716e+00] [ 5.50000e+01 1.23810e-01 4.21084e+00 3.20264e+00] [ 8.20000e+01 2.57650e-01 8.45240e+00 7.93000e-03] [ 8.20000e+01 4.33213e+00 4.22611e+00 2.45000e-03] [ 5.30000e+01 8.40151e+00 -2.04600e-02 3.20239e+00] [ 5.30000e+01 4.31700e+00 4.19584e+00 3.21319e+00] [ 5.30000e+01 6.97306e+00 2.68284e+00 1.13600e-02] [ 5.30000e+01 1.70337e+00 5.71844e+00 6.39885e+00] [ 5.30000e+01 5.93409e+00 6.88047e+00 6.40369e+00] [ 5.30000e+01 2.72114e+00 1.50811e+00 6.39037e+00]] [[ 5.50000e+01 4.35272e+00 6.36200e-02 3.21811e+00] [ 5.50000e+01 1.30500e-01 4.37057e+00 3.21878e+00] [ 8.20000e+01 9.83300e-02 8.62484e+00 7.22000e-03] [ 8.20000e+01 4.48714e+00 4.43211e+00 6.43466e+00] [ 5.30000e+01 1.77660e-01 8.68428e+00 3.21205e+00] [ 5.30000e+01 4.40284e+00 4.37382e+00 3.21493e+00] [ 5.30000e+01 7.19056e+00 2.85967e+00 6.42701e+00] [ 5.30000e+01 1.78389e+00 6.00818e+00 6.43103e+00] [ 5.30000e+01 6.06689e+00 7.17788e+00 6.43279e+00] [ 5.30000e+01 2.73432e+00 1.56879e+00 8.00000e-04]] [[ 5.50000e+01 4.46457e+00 8.40152e+00 3.08672e+00] [ 5.50000e+01 3.06800e-02 4.16050e+00 3.08518e+00] [ 8.20000e+01 5.80200e-02 8.47952e+00 5.13000e-03] [ 8.20000e+01 4.44641e+00 4.24032e+00 2.73000e-03] [ 5.30000e+01 6.57000e-03 -7.77500e-02 3.07974e+00] [ 5.30000e+01 4.44725e+00 4.16390e+00 3.08751e+00] [ 5.30000e+01 7.24828e+00 2.69828e+00 1.72000e-03] [ 5.30000e+01 1.64657e+00 5.62620e+00 6.17415e+00] [ 5.30000e+01 6.07580e+00 6.77635e+00 6.17002e+00] [ 5.30000e+01 2.82429e+00 1.53701e+00 4.56000e-03]] [[ 5.50000e+01 4.13570e+00 8.81508e+00 3.15701e+00] [ 5.50000e+01 8.38743e+00 4.41301e+00 3.15073e+00] [ 8.20000e+01 1.49670e-01 3.49700e-02 6.30504e+00] [ 8.20000e+01 4.12313e+00 4.40225e+00 4.04000e-03] [ 5.30000e+01 7.94700e-02 2.42400e-02 3.15466e+00] [ 5.30000e+01 4.20968e+00 4.41496e+00 3.15226e+00] [ 5.30000e+01 6.96383e+00 2.83861e+00 6.30567e+00] [ 5.30000e+01 1.59341e+00 6.02874e+00 6.30411e+00] [ 5.30000e+01 5.76070e+00 7.23018e+00 6.30686e+00] [ 5.30000e+01 2.63660e+00 1.60195e+00 8.83000e-03]]]
晶胞参数: [[ 8.8449860e+00 0.0000000e+00 0.0000000e+00 -1.2236399e-02 8.3958130e+00 0.0000000e+00 8.8127881e-02 1.5597731e-01 6.3899417e+00] [ 8.3773832e+00 0.0000000e+00 0.0000000e+00 7.4931532e-02 8.8473816e+00 0.0000000e+00 3.5728581e-02 -6.9422799e-04 6.3173370e+00] [ 8.4084988e+00 0.0000000e+00 0.0000000e+00 2.5593123e-01 8.4543428e+00 0.0000000e+00 -9.3760164e-03 -4.2780988e-02 6.4083138e+00] [ 8.5417690e+00 0.0000000e+00 0.0000000e+00 9.2207491e-02 8.6320896e+00 0.0000000e+00 1.6781764e-01 1.1487858e-01 6.4359016e+00] [ 8.8380404e+00 0.0000000e+00 0.0000000e+00 5.2342005e-02 8.4803886e+00 0.0000000e+00 4.9020550e-03 -1.5652113e-01 6.1760850e+00] [ 8.3810892e+00 0.0000000e+00 0.0000000e+00 -1.2794954e-01 8.7994413e+00 0.0000000e+00 1.4874113e-01 3.2601543e-02 6.3107724e+00]]
2.1.2 算例的高精度计算结果标签
这里指的是目标高精度计算的结果,例如能量(energy.npy)、力(force.npy)等,接着上边的例子,我们需要对这100个算例进行HSE06计算以获得高精度的能量数值。在本示例中,我们采用前期已经计算好的标签数据,可以在aissquare平台aissquare上直接下载。
HSE06能量标签: [[-1.23483162] [-1.23362195] [-1.2287092 ] [-1.23558435] [-1.23065087] [-1.23260269]]
2.1.3 算例的低精度计算产生的描述符和能量(初始化)
这里指的是在ABACUS做PBE计算中引入投影子jre.orb获得体系的描述符和低精度能量。
投影子轨道的生成
使用ABACUS运行以下算例即可得到jre.orb文件
INPUT_PARAMETERS
#Parameters (1.General)
suffix abacus
calculation gen_bessel # calculation type should be gen_bessel
nbands 277
symmetry 0
#Parameters (2.Iteration)
ecutwfc 100 # kinetic energy cutoff in unit Ry; should be consistent with that set for ABACUS SCF calculation
scf_thr 1e-8
scf_nmax 128
#Parameters (3.Basis)
basis_type pw
gamma_only 1
#Parameters (4.Smearing)
smearing_method gaussian
smearing_sigma 0.1
#Parameters (5.Mixing)
mixing_type pulay
mixing_beta 0.4
#Parameters (6. Bessel function)
bessel_descriptor_lmax 2 # maximum angular momentum for projectors; 2 is recommended
bessel_descriptor_rcut 5 # radial cutoff in unit Bohr; 5 or 6 is recommended
bessel_descriptor_tolerence 1.0e-12
在/data/deepks_perov_demo/perov_vasp_single_label/jre_gen/OUT.abacus
目录下会生成jre.orb文件
2.1.4 model.ptg的训练
初始化scf计算后可以得到一系列描述符和能量标签,程序自动读取这些信息构造训练和测试数据集进行训练,训练后产生model.ptg
文件
2.2 模型正式循环迭代过程
模型的正式循环迭代过程(iter.00, 01, 02...)是在iter.init初始化循环的基础上在scf计算中引入了上一轮训练产生的model.ptg文件。
流程包括:引入model.ptg的scf计算,MAE收敛性判断以及继续训练或结束训练等过程。
2.3 模型部署
DeePKS的模型部署较为简单,将以上训练产生的model.ptg在INPUT参数中调用即可
ABACUS输入文件示例:
INPUT_PARAMETERS
calculation scf
ntype 3
ecutwfc 100.000000
scf_thr 1.000000e-07
scf_nmax 50
basis_type lcao
dft_functional pbe
gamma_only 0
mixing_type pulay
mixing_beta 0.400000
symmetry 0
nspin 1
smearing_method fixed
smearing_sigma 0.001000
kspacing 0.100000
cal_force 0
cal_stress 0
deepks_out_labels 1
deepks_scf 1
deepks_bandgap 1
deepks_model ../../../model.ptg #模型的位置
3 输入文件介绍
配置文件位于工程目录中的iter下,DeePKS完整工作流程需要用到4个配置文件,分别定义训练体系目录、机器学习配置参数、ABACUS scf计算参数以及运行DeePKS的服务器参数,以下为详细介绍,用户也可参考DeePKS-kit Documentation
高精度训练数据位于工程目录中的systems下,本篇教程中使用的数据为直接从科学智能广场下载的开源数据集,DFT高精度标签生产的计算软件为VASP。使用ABACUS进行高精度训练数据的生产可以参考DeePKS实战(零)|使用DeePKS init功能进行训练数据的生产。
3.1 训练数据集
DeePKS的训练数据主要包括两部分,一部分是算例体系的结构文件,主要包括原子种类及坐标信息以及晶胞的信息;另一部分则是算例的标签数据,例如能量、力、应力和轨道等。结构文件的格式及各种属性标签的对应格式总结如下:
文件名 | 描述 | 形状 | 单位 |
---|---|---|---|
atom.npy | 结构文件,必需 | [nframes, natoms, 4] | Bohr 或 Angstrom 或 分数 |
box.npy | 晶格矢量文件,选填 | [nframes, 9] | Bohr 或 Angstrom |
energy.npy | 能量标签,必需 | [nframes, 1] | Hartree |
force.npy | 力标签,选填 | [nframes, natoms, 3] | Hartree/Bohr |
stress.npy | 应力矢量文件,选填 | [nframes, 9] | Hartree |
orbital.npy | 带隙标签,选填 | [nframes, nkpt, 1] | Hartree |
3.2 systems.yaml 体系配置参数
这个参数文件主要用来定义训练集和测试集具体包含哪些算例,例如以下代码表明训练集为group.04和group.05,测试集为group.06
# this is only part of input settings. # should be used together with params.yaml and machines.yaml # training and testing systems systems_train: # can also be files that containing system paths - ../systems/group.04 # support glob - ../systems/group.05 # support glob systems_test: # if empty, use the last system of training set - ../systems/group.06 # support glob
# this is only part of input settings.
# should be used together with params.yaml and machines.yaml
# training and testing systems
systems_train: # can also be files that containing system paths
- ../systems/group.04 # support glob
- ../systems/group.05 # support glob
systems_test: # if empty, use the last system of training set
- ../systems/group.06 # support glob
3.3 params.yaml 机器学习配置参数
这个参数文件主要是训练过程中的一些超参数,例如batch_size和learning_rate,需要关注的参数是n_iter,这个参数控制了整个DeePKS迭代的轮数。
# this is only part of input settings. # should be used together with systems.yaml and machines.yaml # number of iterations to do, can be set to zero for DeePHF training n_iter: 4 # directory setting (these are default choices, can be omitted) workdir: "." share_folder: "share" # folder that stores all other settings # scf settings, set to false when n_iter = 0 to skip checking scf_input: false # train settings for iterations after iter.init, set to false when n_iter = 0 to skip checking train_input: # model_args is ignored, since this is used as restart data_args: batch_size: 16 group_batch: 1 extra_label: false #set to true if force/stress/energy gap labels are required conv_filter: true conv_name: conv preprocess_args: preshift: false # restarting model already shifted. Will not recompute shift value prescale: false # same as above prefit_ridge: 1e1 prefit_trainable: false train_args: decay_rate: 0.5 decay_steps: 1000 display_epoch: 100 force_factor: 0.0 #for force training; otherwise set to 0. orbital_factor: 0.1 #set to non-zero value (e.g. 0.001) for orbital training. n_epoch: 5000 start_lr: 0.0001 # init settings, these are for DeePHF task init_model: false # do not use existing model to restart from init_scf: true init_train: # parameters for initial nn training fit_elem: true proj_basis: [[0, [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]], [1, [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]], [2, [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]] model_args: hidden_sizes: [100, 100, 100] # neurons in hidden layers output_scale: 100 # the output will be divided by 100 before compare with label use_resnet: true # skip connection actv_fn: mygelu # same as gelu, support force calculation embedding: {embd_sizes: null, init_beta: 5, type: thermal} data_args: batch_size: 16 group_batch: 1 # can collect multiple system in one batch preprocess_args: preshift: true # shift the descriptor by its mean prescale: false # scale the descriptor by its variance (can cause convergence problem) prefit_ridge: 1e1 # do a ridge regression as prefitting prefit_trainable: false train_args: decay_rate: 0.96 # learning rate decay factor decay_steps: 500 # decay the learning rate every this steps display_epoch: 100 n_epoch: 10000 start_lr: 0.001
# this is only part of input settings.
# should be used together with systems.yaml and machines.yaml
# number of iterations to do, can be set to zero for DeePHF training
n_iter: 4
# directory setting (these are default choices, can be omitted)
workdir: "."
share_folder: "share" # folder that stores all other settings
# scf settings, set to false when n_iter = 0 to skip checking
scf_input: false
# train settings for iterations after iter.init, set to false when n_iter = 0 to skip checking
train_input:
# model_args is ignored, since this is used as restart
data_args:
batch_size: 16
group_batch: 1
extra_label: false #set to true if force/stress/energy gap labels are required
conv_filter: true
conv_name: conv
preprocess_args:
preshift: false # restarting model already shifted. Will not recompute shift value
prescale: false # same as above
prefit_ridge: 1e1
prefit_trainable: false
train_args:
decay_rate: 0.5
decay_steps: 1000
display_epoch: 100
force_factor: 0.0 #for force training; otherwise set to 0.
orbital_factor: 0.1 #set to non-zero value (e.g. 0.001) for orbital training.
n_epoch: 5000
start_lr: 0.0001
# init settings, these are for DeePHF task
init_model: false # do not use existing model to restart from
init_scf: true
init_train: # parameters for initial nn training
fit_elem: true
proj_basis: [[0, [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]],
[1, [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]],
[2, [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]]
model_args:
hidden_sizes: [100, 100, 100] # neurons in hidden layers
output_scale: 100 # the output will be divided by 100 before compare with label
use_resnet: true # skip connection
actv_fn: mygelu # same as gelu, support force calculation
embedding: {embd_sizes: null, init_beta: 5, type: thermal}
data_args:
batch_size: 16
group_batch: 1 # can collect multiple system in one batch
preprocess_args:
preshift: true # shift the descriptor by its mean
prescale: false # scale the descriptor by its variance (can cause convergence problem)
prefit_ridge: 1e1 # do a ridge regression as prefitting
prefit_trainable: false
train_args:
decay_rate: 0.96 # learning rate decay factor
decay_steps: 500 # decay the learning rate every this steps
display_epoch: 100
n_epoch: 10000
start_lr: 0.001
3.4 scf_abacus.yaml ABACUS计算配置文件
scf_abacus: #INPUT args # ntype: 3 ecutwfc: 100 scf_thr: 1e-7 scf_nmax: 50 dft_functional: "pbe" gamma_only: 0 kspacing: 0.1 cal_force: 0 deepks_bandgap: 1 #STRU args ( Here are default STRU args, you can set for each group in ../systems/group.xx/stru_abacus.yaml ) orb_files: ["Cs_gga_10au_100Ry_4s2p1d.orb","Pb_gga_7au_100Ry_2s2p2d1f.orb","I_gga_7au_100Ry_2s2p2d1f.orb"] pp_files: ["Cs_ONCV_PBE-1.0.upf","Pb_ONCV_PBE-1.0.upf","I_ONCV_PBE-1.0.upf"] proj_file: ["jle.orb"] lattice_constant: 1.88972613 lattice_vector: [[28, 0, 0], [0, 28, 0], [0, 0, 28]] coord_type: "Cartesian" #cmd args run_cmd : "mpirun" abacus_path: "abacus" init_scf_abacus: orb_files: ["Cs_gga_10au_100Ry_4s2p1d.orb","Pb_gga_7au_100Ry_2s2p2d1f.orb","I_gga_7au_100Ry_2s2p2d1f.orb"] pp_files: ["Cs_ONCV_PBE-1.0.upf","Pb_ONCV_PBE-1.0.upf","I_ONCV_PBE-1.0.upf"] proj_file: ["jle.orb"] # ntype: 3 ecutwfc: 100 scf_thr: 1e-7 scf_nmax: 50 dft_functional: "pbe" gamma_only: 0 kspacing: 0.1 cal_force: 0 lattice_constant: 1.88972613 lattice_vector: [[28, 0, 0], [0, 28, 0], [0, 0, 28]] coord_type: "Cartesian" #cmd args run_cmd : "mpirun" abacus_path: "abacus"
scf_abacus:
#INPUT args
ntype: 3
ecutwfc: 100
scf_thr: 1e-7
scf_nmax: 50
dft_functional: "pbe"
gamma_only: 0
kspacing: 0.1
cal_force: 0
deepks_bandgap: 1
#STRU args ( Here are default STRU args, you can set for each group in ../systems/group.xx/stru_abacus.yaml )
orb_files: ["Cs_gga_10au_100Ry_4s2p1d.orb","Pb_gga_7au_100Ry_2s2p2d1f.orb","I_gga_7au_100Ry_2s2p2d1f.orb"]
pp_files: ["Cs_ONCV_PBE-1.0.upf","Pb_ONCV_PBE-1.0.upf","I_ONCV_PBE-1.0.upf"]
proj_file: ["jle.orb"]
lattice_constant: 1.88972613
lattice_vector: [[28, 0, 0], [0, 28, 0], [0, 0, 28]]
coord_type: "Cartesian"
#cmd args
run_cmd : "mpirun"
abacus_path: "abacus"
init_scf_abacus:
orb_files: ["Cs_gga_10au_100Ry_4s2p1d.orb","Pb_gga_7au_100Ry_2s2p2d1f.orb","I_gga_7au_100Ry_2s2p2d1f.orb"]
pp_files: ["Cs_ONCV_PBE-1.0.upf","Pb_ONCV_PBE-1.0.upf","I_ONCV_PBE-1.0.upf"]
proj_file: ["jle.orb"]
ntype: 3
ecutwfc: 100
scf_thr: 1e-7
scf_nmax: 50
dft_functional: "pbe"
gamma_only: 0
kspacing: 0.1
cal_force: 0
lattice_constant: 1.88972613
lattice_vector: [[28, 0, 0], [0, 28, 0], [0, 0, 28]]
coord_type: "Cartesian"
#cmd args
run_cmd : "mpirun"
abacus_path: "abacus"
3.5 machines.yaml 本地运行机器配置文件
# this is only part of input settings. # should be used together with systems.yaml and params.yaml scf_machine: group_size: 125 resources: task_per_node: 1 sub_size: 4 dispatcher: context: local batch: shell # set to shell to run on local machine, you can also use `slurm` train_machine: dispatcher: context: local batch: shell # same as above, use shell to run on local machine remote_profile: null # use lazy local python: "python" # use python in path # resources are no longer needed, and the task will use gpu automatically if there is one # other settings (these are default, can be omitted) cleanup: false # whether to delete slurm and err files strict: true # do not allow undefined machine parameters #paras for abacus use_abacus: true # use abacus in scf calculation
# this is only part of input settings.
# should be used together with systems.yaml and params.yaml
scf_machine:
group_size: 125
resources:
task_per_node: 1
sub_size: 4
dispatcher:
context: local
batch: shell # set to shell to run on local machine, you can also use `slurm`
train_machine:
dispatcher:
context: local
batch: shell # same as above, use shell to run on local machine
remote_profile: null # use lazy local
python: "python" # use python in path
# resources are no longer needed, and the task will use gpu automatically if there is one
# other settings (these are default, can be omitted)
cleanup: false # whether to delete slurm and err files
strict: true # do not allow undefined machine parameters
#paras for abacus
use_abacus: true # use abacus in scf calculation
3.6 machines_dpdispatcher.yaml 使用DP计算节点配置文件
scf_machine: resources: task_per_node: 8 dispatcher: dpdispatcher dpdispatcher_resources: number_node: 1 cpu_per_node: 16 group_size: 5 source_list: [/opt/intel/oneapi/setvars.sh] sub_size: 1 dpdispatcher_machine: context_type: lebesguecontext batch_type: lebesgue local_root: ./ remote_profile: email: password: program_id: input_data: log_file: log.scf err_file: err.scf job_type: indicate grouped: true job_name: deepks-scf disk_size: 100 scass_type: c32_m128_cpu # machine type platform: ali image_name: registry.dp.tech/dptech/prod-36094/deepks:v0.1.3 # image name on_demand: 0 train_machine: dispatcher: dpdispatcher dpdispatcher_machine: context_type: lebesguecontext batch_type: lebesgue local_root: ./ remote_profile: email: password: program_id: input_data: log_file: log.train err_file: err.train job_type: indicate grouped: true job_name: deepks-train disk_size: 100 scass_type: c32_m128_cpu platform: ali image_name: registry.dp.tech/dptech/prod-36094/deepks:v0.1.3 on_demand: 0 dpdispatcher_resources: number_node: 1 cpu_per_node: 8 group_size: 1 source_list: [~/.bashrc] python: "python" # use python in path # resources are no longer needed, and the task will use gpu automatically if there is one # other settings (these are default; can be omitted) cleanup: false # whether to delete slurm and err files strict: true # do not allow undefined machine parameters #paras for abacus use_abacus: true # use abacus in scf calculation
scf_machine:
resources:
task_per_node: 8
dispatcher: dpdispatcher
dpdispatcher_resources:
number_node: 1
cpu_per_node: 16
group_size: 5
source_list: [/opt/intel/oneapi/setvars.sh]
sub_size: 1
dpdispatcher_machine:
context_type: lebesguecontext
batch_type: lebesgue
local_root: ./
remote_profile:
email:
password:
program_id:
input_data:
log_file: log.scf
err_file: err.scf
job_type: indicate
grouped: true
job_name: deepks-scf
disk_size: 100
scass_type: c32_m128_cpu # 机器配置
platform: ali
image_name: registry.dp.tech/dptech/prod-36094/deepks:v0.1.3 # 提交到bohrium镜像计算使用的镜像
on_demand: 0
train_machine:
dispatcher: dpdispatcher
dpdispatcher_machine:
context_type: lebesguecontext
batch_type: lebesgue
local_root: ./
remote_profile:
email:
password:
program_id:
input_data:
log_file: log.train
err_file: err.train
job_type: indicate
grouped: true
job_name: deepks-train
disk_size: 100
scass_type: c32_m128_cpu
platform: ali
image_name: registry.dp.tech/dptech/prod-36094/deepks:v0.1.3
on_demand: 0
dpdispatcher_resources:
number_node: 1
cpu_per_node: 8
group_size: 1
source_list: [~/.bashrc]
python: "python" # use python in path
# resources are no longer needed, and the task will use gpu automatically if there is one
# other settings (these are default; can be omitted)
cleanup: false # whether to delete slurm and err files
strict: true # do not allow undefined machine parameters
#paras for abacus
use_abacus: true # use abacus in scf calculation
4 输出文件介绍
DeePKS的输出文件主要包括记录整个流程的log.iter
和RECORD
文件,记录scf过程的log.data
文件,记录机器学习过程的log.train
文件,以及最终输出的model.ptg
文件,下边分别介绍
4.1 log.iter 迭代日志文件
位置:iter/log.iter
主要作用:输出程序当前执行的步骤的信息,step后的数字表明当前步骤,与下文RECORD中的定义相同。
# starting step: (0,) //正常执行步骤为starting step
# starting step: (0, 0)
# starting step: (0, 0, 0)
# starting step: (0, 0, 1)
2024-08-26 14:52:42,323 - INFO : info:check_all_finished: False
2024-08-26 14:52:42,323 - INFO : checking all job has been uploaded
2024-08-26 14:53:07,905 - INFO : job: 0f144ccdf7dd8dbd11e98656b96814199ba3b55f submit; job_id is 14356051:job_group_id:13233777
2024-08-26 14:53:08,208 - INFO : job: 851ac7d92ae55bbde4df41a3bb1b77c17394d18e submit; job_id is 14356052:job_group_id:13233777
2024-08-26 14:53:08,542 - INFO : job: dbeb4c25a5ede86622d661d6add29525e369b7ff submit; job_id is 14356053:job_group_id:13233777
2024-08-26 14:53:08,866 - INFO : job: ca553d1fe36a4bb4c436d89e125012bfa05008ff submit; job_id is 14356054:job_group_id:13233777
2024-08-26 14:53:09,222 - INFO : job: a4b0dd4c067164f06d5ccc6c14f6130c3e676440 submit; job_id is 14356055:job_group_id:13233777
2024-08-26 14:53:09,567 - INFO : job: 4f78ec82ae2fc3495d867991b2cead7ea630e55f submit; job_id is 14356056:job_group_id:13233777
2024-08-26 14:53:09,905 - INFO : job: fadd53b56edfbf078bef4d0e12058f3ce4d81c64 submit; job_id is 14356057:job_group_id:13233777
2024-08-26 14:53:10,230 - INFO : job: 0ac986537561dd13f9e2f37ebb6e41a31f5a980b submit; job_id is 14356058:job_group_id:13233777
2024-08-26 14:53:10,584 - INFO : job: 1373be20e3769de1cb0826d9afbcb55c5e6790e4 submit; job_id is 14356059:job_group_id:13233777
2024-08-26 14:53:10,945 - INFO : job: 56e7c4598c18e82411c6e488465f92d656709387 submit; job_id is 14356060:job_group_id:13233777
2024-08-26 14:53:11,332 - INFO : job: d1d503b9b3ed56ca97b34ecdf0dcd2293641e907 submit; job_id is 14356061:job_group_id:13233777
2024-08-26 14:53:11,627 - INFO : job: 1cd0b1ab35ac37a8a0d3d738b089287068c8ad62 submit; job_id is 14356062:job_group_id:13233777
2024-08-26 14:53:11,952 - INFO : job: 3fe2ccc772eeb26f7b102eab3a10873aa3ca6ed2 submit; job_id is 14356063:job_group_id:13233777
2024-08-26 14:53:12,288 - INFO : job: 6ccc8c9fbe67d504b47d8a02a88eee478fdf3c73 submit; job_id is 14356064:job_group_id:13233777
# starting step: (0, 0, 2)
# starting step: (0, 1)
# starting step: (0, 1, 0)
2024-08-26 15:09:10,808 - INFO : info:check_all_finished: False
2024-08-26 15:09:10,808 - INFO : checking all job has been uploaded
2024-08-26 15:09:12,503 - INFO : job: 347dce77f33f81e009bc189ab36e9bc161087ca3 submit; job_id is 14356504:job_group_id:13233826
# starting step: (0, 1, 1)
# starting step: (1,)
# starting step: (1, 0)
# starting step: (1, 0, 0)
# starting step: (1, 0, 1)
2024-08-26 15:26:55,690 - INFO : info:check_all_finished: False
2024-08-26 15:26:55,691 - INFO : checking all job has been uploaded
2024-08-26 15:27:21,128 - INFO : job: 0f144ccdf7dd8dbd11e98656b96814199ba3b55f submit; job_id is 14356666:job_group_id:13233843
2024-08-26 15:27:21,435 - INFO : job: 851ac7d92ae55bbde4df41a3bb1b77c17394d18e submit; job_id is 14356668:job_group_id:13233843
2024-08-26 15:27:21,752 - INFO : job: dbeb4c25a5ede86622d661d6add29525e369b7ff submit; job_id is 14356670:job_group_id:13233843
2024-08-26 15:27:22,069 - INFO : job: ca553d1fe36a4bb4c436d89e125012bfa05008ff submit; job_id is 14356672:job_group_id:13233843
2024-08-26 15:27:22,412 - INFO : job: a4b0dd4c067164f06d5ccc6c14f6130c3e676440 submit; job_id is 14356674:job_group_id:13233843
2024-08-26 15:27:22,699 - INFO : job: 4f78ec82ae2fc3495d867991b2cead7ea630e55f submit; job_id is 14356676:job_group_id:13233843
2024-08-26 15:27:23,082 - INFO : job: fadd53b56edfbf078bef4d0e12058f3ce4d81c64 submit; job_id is 14356678:job_group_id:13233843
2024-08-26 15:27:23,426 - INFO : job: 0ac986537561dd13f9e2f37ebb6e41a31f5a980b submit; job_id is 14356680:job_group_id:13233843
2024-08-26 15:27:23,777 - INFO : job: 1373be20e3769de1cb0826d9afbcb55c5e6790e4 submit; job_id is 14356682:job_group_id:13233843
2024-08-26 15:27:24,110 - INFO : job: 56e7c4598c18e82411c6e488465f92d656709387 submit; job_id is 14356685:job_group_id:13233843
2024-08-26 15:27:24,426 - INFO : job: d1d503b9b3ed56ca97b34ecdf0dcd2293641e907 submit; job_id is 14356686:job_group_id:13233843
2024-08-26 15:27:24,728 - INFO : job: 1cd0b1ab35ac37a8a0d3d738b089287068c8ad62 submit; job_id is 14356688:job_group_id:13233843
2024-08-26 15:27:25,069 - INFO : job: 3fe2ccc772eeb26f7b102eab3a10873aa3ca6ed2 submit; job_id is 14356690:job_group_id:13233843
2024-08-26 15:27:25,373 - INFO : job: 6ccc8c9fbe67d504b47d8a02a88eee478fdf3c73 submit; job_id is 14356692:job_group_id:13233843
# starting step: (1, 0, 2)
# restarting after step (1, 0, 1) //重启任务为restarting after step
# starting step: (1, 0, 2)
# starting step: (1, 1)
# starting step: (1, 1, 0)
2024-08-26 16:00:11,362 - INFO : info:check_all_finished: False
2024-08-26 16:00:11,363 - INFO : checking all job has been uploaded
2024-08-26 16:00:16,955 - INFO : job: a6aac67db774aa161a36fb8144717eb9970b469f submit; job_id is 14357062:job_group_id:13233891
# starting step: (1, 1, 1)
# starting step: (2,)
# starting step: (2, 0)
# starting step: (2, 0, 0)
# starting step: (2, 0, 1)
2024-08-26 16:24:06,143 - INFO : info:check_all_finished: False
2024-08-26 16:24:06,144 - INFO : checking all job has been uploaded
2024-08-26 16:24:33,572 - INFO : job: 0f144ccdf7dd8dbd11e98656b96814199ba3b55f submit; job_id is 14357460:job_group_id:13233955
2024-08-26 16:24:33,877 - INFO : job: 851ac7d92ae55bbde4df41a3bb1b77c17394d18e submit; job_id is 14357462:job_group_id:13233955
2024-08-26 16:24:34,224 - INFO : job: dbeb4c25a5ede86622d661d6add29525e369b7ff submit; job_id is 14357463:job_group_id:13233955
2024-08-26 16:24:34,532 - INFO : job: ca553d1fe36a4bb4c436d89e125012bfa05008ff submit; job_id is 14357464:job_group_id:13233955
2024-08-26 16:24:34,850 - INFO : job: a4b0dd4c067164f06d5ccc6c14f6130c3e676440 submit; job_id is 14357465:job_group_id:13233955
2024-08-26 16:24:35,152 - INFO : job: 4f78ec82ae2fc3495d867991b2cead7ea630e55f submit; job_id is 14357466:job_group_id:13233955
2024-08-26 16:24:35,466 - INFO : job: fadd53b56edfbf078bef4d0e12058f3ce4d81c64 submit; job_id is 14357467:job_group_id:13233955
2024-08-26 16:24:35,899 - INFO : job: 0ac986537561dd13f9e2f37ebb6e41a31f5a980b submit; job_id is 14357468:job_group_id:13233955
2024-08-26 16:24:36,230 - INFO : job: 1373be20e3769de1cb0826d9afbcb55c5e6790e4 submit; job_id is 14357469:job_group_id:13233955
2024-08-26 16:24:36,578 - INFO : job: 56e7c4598c18e82411c6e488465f92d656709387 submit; job_id is 14357470:job_group_id:13233955
2024-08-26 16:24:36,923 - INFO : job: d1d503b9b3ed56ca97b34ecdf0dcd2293641e907 submit; job_id is 14357472:job_group_id:13233955
2024-08-26 16:24:37,239 - INFO : job: 1cd0b1ab35ac37a8a0d3d738b089287068c8ad62 submit; job_id is 14357474:job_group_id:13233955
2024-08-26 16:24:37,596 - INFO : job: 3fe2ccc772eeb26f7b102eab3a10873aa3ca6ed2 submit; job_id is 14357475:job_group_id:13233955
2024-08-26 16:24:37,924 - INFO : job: 6ccc8c9fbe67d504b47d8a02a88eee478fdf3c73 submit; job_id is 14357476:job_group_id:13233955
# starting step: (2, 0, 2)
# starting step: (2, 1)
# starting step: (2, 1, 0)
2024-08-26 16:34:53,231 - INFO : info:check_all_finished: False
2024-08-26 16:34:53,231 - INFO : checking all job has been uploaded
2024-08-26 16:34:58,782 - INFO : job: a6aac67db774aa161a36fb8144717eb9970b469f submit; job_id is 14357639:job_group_id:13233983
# starting step: (2, 1, 1)
# starting step: (3,)
# starting step: (3, 0)
# starting step: (3, 0, 0)
# starting step: (3, 0, 1)
2024-08-26 17:05:22,966 - INFO : info:check_all_finished: False
2024-08-26 17:05:22,967 - INFO : checking all job has been uploaded
2024-08-26 17:06:08,550 - INFO : job: 0f144ccdf7dd8dbd11e98656b96814199ba3b55f submit; job_id is 14357935:job_group_id:13234204
2024-08-26 17:06:08,869 - INFO : job: 851ac7d92ae55bbde4df41a3bb1b77c17394d18e submit; job_id is 14357936:job_group_id:13234204
2024-08-26 17:06:09,184 - INFO : job: dbeb4c25a5ede86622d661d6add29525e369b7ff submit; job_id is 14357937:job_group_id:13234204
2024-08-26 17:06:09,479 - INFO : job: ca553d1fe36a4bb4c436d89e125012bfa05008ff submit; job_id is 14357938:job_group_id:13234204
2024-08-26 17:06:09,802 - INFO : job: a4b0dd4c067164f06d5ccc6c14f6130c3e676440 submit; job_id is 14357939:job_group_id:13234204
2024-08-26 17:06:10,111 - INFO : job: 4f78ec82ae2fc3495d867991b2cead7ea630e55f submit; job_id is 14357940:job_group_id:13234204
2024-08-26 17:06:10,420 - INFO : job: fadd53b56edfbf078bef4d0e12058f3ce4d81c64 submit; job_id is 14357941:job_group_id:13234204
2024-08-26 17:06:10,709 - INFO : job: 0ac986537561dd13f9e2f37ebb6e41a31f5a980b submit; job_id is 14357942:job_group_id:13234204
2024-08-26 17:06:11,037 - INFO : job: 1373be20e3769de1cb0826d9afbcb55c5e6790e4 submit; job_id is 14357943:job_group_id:13234204
2024-08-26 17:06:11,463 - INFO : job: 56e7c4598c18e82411c6e488465f92d656709387 submit; job_id is 14357944:job_group_id:13234204
2024-08-26 17:06:11,746 - INFO : job: d1d503b9b3ed56ca97b34ecdf0dcd2293641e907 submit; job_id is 14357945:job_group_id:13234204
2024-08-26 17:06:12,076 - INFO : job: 1cd0b1ab35ac37a8a0d3d738b089287068c8ad62 submit; job_id is 14357946:job_group_id:13234204
2024-08-26 17:06:12,390 - INFO : job: 3fe2ccc772eeb26f7b102eab3a10873aa3ca6ed2 submit; job_id is 14357947:job_group_id:13234204
2024-08-26 17:06:12,701 - INFO : job: 6ccc8c9fbe67d504b47d8a02a88eee478fdf3c73 submit; job_id is 14357948:job_group_id:13234204
# starting step: (3, 0, 2)
# starting step: (3, 1)
# starting step: (3, 1, 0)
2024-08-26 17:14:56,697 - INFO : info:check_all_finished: False
2024-08-26 17:14:56,698 - INFO : checking all job has been uploaded
2024-08-26 17:15:03,535 - INFO : job: a6aac67db774aa161a36fb8144717eb9970b469f submit; job_id is 14357970:job_group_id:13234219
# starting step: (3, 1, 1)
# starting step: (4,)
# starting step: (4, 0)
# starting step: (4, 0, 0)
# starting step: (4, 0, 1)
2024-08-26 17:47:56,516 - INFO : info:check_all_finished: False
2024-08-26 17:47:56,517 - INFO : checking all job has been uploaded
2024-08-26 17:48:23,971 - INFO : job: 0f144ccdf7dd8dbd11e98656b96814199ba3b55f submit; job_id is 14358353:job_group_id:13234331
2024-08-26 17:48:24,274 - INFO : job: 851ac7d92ae55bbde4df41a3bb1b77c17394d18e submit; job_id is 14358354:job_group_id:13234331
2024-08-26 17:48:24,576 - INFO : job: dbeb4c25a5ede86622d661d6add29525e369b7ff submit; job_id is 14358355:job_group_id:13234331
2024-08-26 17:48:24,893 - INFO : job: ca553d1fe36a4bb4c436d89e125012bfa05008ff submit; job_id is 14358356:job_group_id:13234331
2024-08-26 17:48:25,262 - INFO : job: a4b0dd4c067164f06d5ccc6c14f6130c3e676440 submit; job_id is 14358357:job_group_id:13234331
2024-08-26 17:48:25,544 - INFO : job: 4f78ec82ae2fc3495d867991b2cead7ea630e55f submit; job_id is 14358358:job_group_id:13234331
2024-08-26 17:48:25,904 - INFO : job: fadd53b56edfbf078bef4d0e12058f3ce4d81c64 submit; job_id is 14358359:job_group_id:13234331
2024-08-26 17:48:26,310 - INFO : job: 0ac986537561dd13f9e2f37ebb6e41a31f5a980b submit; job_id is 14358360:job_group_id:13234331
2024-08-26 17:48:26,681 - INFO : job: 1373be20e3769de1cb0826d9afbcb55c5e6790e4 submit; job_id is 14358361:job_group_id:13234331
2024-08-26 17:48:26,983 - INFO : job: 56e7c4598c18e82411c6e488465f92d656709387 submit; job_id is 14358362:job_group_id:13234331
2024-08-26 17:48:27,289 - INFO : job: d1d503b9b3ed56ca97b34ecdf0dcd2293641e907 submit; job_id is 14358363:job_group_id:13234331
2024-08-26 17:48:27,606 - INFO : job: 1cd0b1ab35ac37a8a0d3d738b089287068c8ad62 submit; job_id is 14358364:job_group_id:13234331
2024-08-26 17:48:27,956 - INFO : job: 3fe2ccc772eeb26f7b102eab3a10873aa3ca6ed2 submit; job_id is 14358365:job_group_id:13234331
2024-08-26 17:48:28,292 - INFO : job: 6ccc8c9fbe67d504b47d8a02a88eee478fdf3c73 submit; job_id is 14358366:job_group_id:13234331
# starting step: (4, 0, 2)
# starting step: (4, 1)
# starting step: (4, 1, 0)
2024-08-26 17:57:07,197 - INFO : info:check_all_finished: False
2024-08-26 17:57:07,198 - INFO : checking all job has been uploaded
2024-08-26 17:57:13,028 - INFO : job: a6aac67db774aa161a36fb8144717eb9970b469f submit; job_id is 14358367:job_group_id:13234332
# starting step: (4, 1, 1)
# restarting after step (4,)
4.2 RECORD 迭代记录
位置:iter/RECORD
主要作用:记录迭代信息,控制程序从哪里开始续算
(X 0 0):在第X次迭代时(X=0 对应 iter.init;X=1 对应 iter.00;X=2 对应 iter.01;依此类推),对所有算例进行SCF的预处理,生成ABACUS工作目录和输入文件。
(X 0 1):在ABACUS中运行SCF计算。
(X 0 2):连接并检查SCF结果,在 iter.xx/00.scf 中的 log.data 中打印收敛性和精度。
(X 0):当前SCF作业完成;准备进行训练。
(X 1 0):使用旧模型(如果有)作为起点训练一个新模型。
(X 1 1):当前训练完成;学习曲线记录在 iter.xx/01.train 中的 log.train 中。
(X 1):在所有数据上测试模型,以查看 iter.xx/01.train 中 log.test 中的纯拟合误差。
(X):当前迭代完成。
0 0 0 // 第0次迭代的预处理阶段,生成ABACUS工作目录和输入文件
0 0 1 // 第0次迭代的SCF计算阶段,在ABACUS中运行SCF计算
0 0 2 // 第0次迭代的SCF结果处理阶段,检查SCF结果,打印收敛性和精度
0 0 // 第0次迭代的SCF任务完成,准备训练模型
0 1 0 // 第0次迭代的模型训练阶段,使用旧模型(如果有)训练新模型
0 1 1 // 第0次迭代的训练完成,记录学习曲线
0 1 // 第0次迭代的模型测试阶段,测试模型并查看纯拟合误差
0 // 第0次迭代完成
1 0 0 // 第1次迭代的预处理阶段,生成ABACUS工作目录和输入文件
1 0 1 // 第1次迭代的SCF计算阶段,在ABACUS中运行SCF计算
1 0 2 // 第1次迭代的SCF结果处理阶段,检查SCF结果,打印收敛性和精度
1 0 // 第1次迭代的SCF任务完成,准备训练模型
1 1 0 // 第1次迭代的模型训练阶段,使用旧模型(如果有)训练新模型
1 1 1 // 第1次迭代的训练完成,记录学习曲线
1 1 // 第1次迭代的模型测试阶段,测试模型并查看纯拟合误差
1 // 第1次迭代完成
2 0 0 // 第2次迭代的预处理阶段,生成ABACUS工作目录和输入文件
2 0 1 // 第2次迭代的SCF计算阶段,在ABACUS中运行SCF计算
2 0 2 // 第2次迭代的SCF结果处理阶段,检查SCF结果,打印收敛性和精度
2 0 // 第2次迭代的SCF任务完成,准备训练模型
2 1 0 // 第2次迭代的模型训练阶段,使用旧模型(如果有)训练新模型
2 1 1 // 第2次迭代的训练完成,记录学习曲线
2 1 // 第2次迭代的模型测试阶段,测试模型并查看纯拟合误差
2 // 第2次迭代完成
3 0 0 // 第3次迭代的预处理阶段,生成ABACUS工作目录和输入文件
3 0 1 // 第3次迭代的SCF计算阶段,在ABACUS中运行SCF计算
3 0 2 // 第3次迭代的SCF结果处理阶段,检查SCF结果,打印收敛性和精度
3 0 // 第3次迭代的SCF任务完成,准备训练模型
3 1 0 // 第3次迭代的模型训练阶段,使用旧模型(如果有)训练新模型
3 1 1 // 第3次迭代的训练完成,记录学习曲线
3 1 // 第3次迭代的模型测试阶段,测试模型并查看纯拟合误差
3 // 第3次迭代完成
4 0 0 // 第4次迭代的预处理阶段,生成ABACUS工作目录和输入文件
4 0 1 // 第4次迭代的SCF计算阶段,在ABACUS中运行SCF计算
4 0 2 // 第4次迭代的SCF结果处理阶段,检查SCF结果,打印收敛性和精度
4 0 // 第4次迭代的SCF任务完成,准备训练模型
4 1 0 // 第4次迭代的模型训练阶段,使用旧模型(如果有)训练新模型
4 1 1 // 第4次迭代的训练完成,记录学习曲线
4 1 // 第4次迭代的模型测试阶段,测试模型并查看纯拟合误差
4 // 第4次迭代完成
注:RECORD文件有类似于checkpoint的功能,程序出现异常中断后,再次提交任务时会从RECORD中读取上一次计算到的位置并进行续算,并且可以通过编辑RECORD文件让程序在指定位置重启。
4.3 err.iter 错误日志
DeePKS程序中断时会输出异常日志,需要根据具体的错误进行分析。
4.4 log.data算例收敛和误差日志
路径:iter/iter.xx/00.scf/log.data
主要作用:记录训练集和测试集的算例收敛情况,init初始化轮次未引入deepks训练的model,通常算例全部可以收敛。正式轮次初始阶段可能由于训练得到的model不够合理导致部分算例不能收敛,经验上认为iter循环次数越多算例的收敛性会逐渐改善。
此文件包含每次迭代的误差统计和SCF收敛率。示例如下所示:
Training:
Convergence:
30 / 30 = 1.00000
Energy:
ME: 1368.147964193488
MAE: 1368.147964193488
MARE: 243.226232035534
Testing:
Convergence:
37 / 37 = 1.00000
Energy:
ME: 1520.1671384114968
MAE: 1520.1671384114968
MARE: 152.01917421800897
其中ME = 平均误差(mean error),MAE = 平均绝对误差(mean absolute error),MARE = 平均相对绝对误差(mean relative absolute error)。MARE的计算方法是去除目标能量与基准能量之间的任何常数能量偏移。请注意,这里仅包含能量误差,因为在初始迭代中仅对能量标签进行了训练。
在下边这个例子中,通过在params.yaml中将extra_label设置为true并将force_factor设置为1,力(force)标签在初始迭代之后被激活。因此,其log.data包含了力误差统计:
Training:
Convergence:
899 / 900 = 0.99889
Energy:
ME: 1.707869318132222e-05
MAE: 3.188871711078968e-05
MARE: 3.054509587845316e-05
Force:
MAE: 0.00030976685248761896
Testing:
Convergence:
100 / 100 = 1.00000
Energy:
ME: 1.8457155353139854e-05
MAE: 3.5420404788446546e-05
MARE: 3.3798956665677724e-05
Force:
MAE: 0.0003271656570860149
为了判断DeePKS模型是否收敛,用户可以比较当前迭代和前一次迭代的log.data中的误差统计数据,如果误差基本保持不变,则可以认为模型已经收敛。
4.5 log.train模型训练日志
位置:iter/iter.xx/01.train/log.train
主要作用:输出训练日志
此文件记录了每次迭代的训练过程中的学习曲线。需要注意的是,对于初始迭代之后的每次迭代,此文件中记录的训练误差(trn err)对应于训练集的总误差,即能量误差加上来自额外标签的误差,而测试误差(tst err)仅对应于测试集的能量误差。对于初始训练,由于不包含额外标签,训练误差和测试误差都只对应于能量误差。
对于一个成功的训练过程,用户可以预期在前一两次迭代中,训练误差和测试误差都会显著下降。随着迭代训练的进行,误差的下降幅度将逐渐变得不明显。
# using seed: 305872219
# load 2 systems with fields {'lb_o', 'lb_e', 'op', 'eig'}
# load 1 systems with fields {'lb_o', 'lb_e', 'op', 'eig'}
# working on device: cpu
# epoch trn_err tst_err lr trn_time tst_time
0 1.60e-01 1.83e-01 1.00e-04 0.00 0.58
100 1.42e-02 2.15e-02 1.00e-04 0.09 0.01
200 1.02e-02 3.22e-03 1.00e-04 0.08 0.01
300 8.77e-03 3.16e-03 1.00e-04 0.09 0.01
400 9.96e-03 3.19e-03 1.00e-04 0.07 0.01
500 8.63e-03 3.14e-03 1.00e-04 0.09 0.01
600 8.47e-03 3.04e-03 1.00e-04 0.09 0.01
700 6.79e-03 3.03e-03 1.00e-04 0.23 0.01
800 9.64e-03 2.95e-03 1.00e-04 0.07 0.01
900 6.52e-03 2.87e-03 1.00e-04 0.07 0.01
1000 9.22e-03 2.70e-03 5.00e-05 0.07 0.01
1100 7.88e-03 2.68e-03 5.00e-05 0.10 0.01
1200 6.01e-03 2.65e-03 5.00e-05 0.07 0.01
1300 9.03e-03 2.57e-03 5.00e-05 0.07 0.01
1400 8.85e-03 2.44e-03 5.00e-05 0.07 0.01
1500 5.52e-03 2.49e-03 5.00e-05 0.97 0.01
1600 8.34e-03 2.31e-03 5.00e-05 0.07 0.01
1700 5.38e-03 2.27e-03 5.00e-05 0.08 0.01
1800 7.80e-03 2.22e-03 5.00e-05 0.07 0.01
1900 6.30e-03 2.25e-03 5.00e-05 0.09 0.01
2000 7.20e-03 2.37e-03 2.50e-05 0.07 0.02
2100 4.57e-03 2.68e-03 2.50e-05 0.08 0.01
2200 6.67e-03 2.55e-03 2.50e-05 0.07 0.01
2300 6.65e-03 2.64e-03 2.50e-05 0.07 0.01
2400 6.35e-03 2.66e-03 2.50e-05 0.17 0.01
2500 5.23e-03 2.85e-03 2.50e-05 1.94 0.59
2600 5.15e-03 2.85e-03 2.50e-05 0.09 0.01
2700 5.87e-03 2.89e-03 2.50e-05 0.08 0.01
2800 4.97e-03 3.20e-03 2.50e-05 0.08 0.36
2900 5.37e-03 3.06e-03 2.50e-05 0.07 0.01
3000 5.53e-03 3.18e-03 1.25e-05 0.07 0.01
3100 5.49e-03 3.36e-03 1.25e-05 0.07 0.01
3200 4.95e-03 3.20e-03 1.25e-05 0.07 0.01
3300 5.04e-03 3.23e-03 1.25e-05 0.08 0.01
3400 4.96e-03 3.25e-03 1.25e-05 0.09 0.01
3500 4.79e-03 3.26e-03 1.25e-05 0.08 0.01
3600 4.95e-03 3.29e-03 1.25e-05 0.08 0.01
3700 3.94e-03 3.31e-03 1.25e-05 0.07 0.01
3800 4.75e-03 3.33e-03 1.25e-05 0.07 0.01
3900 4.65e-03 3.33e-03 1.25e-05 0.07 0.01
4000 3.83e-03 3.37e-03 6.25e-06 0.08 0.01
4100 4.54e-03 3.37e-03 6.25e-06 0.08 0.01
4200 4.22e-03 3.36e-03 6.25e-06 0.07 0.01
4300 4.47e-03 3.37e-03 6.25e-06 0.07 0.01
4400 4.47e-03 3.37e-03 6.25e-06 0.08 0.01
4500 4.36e-03 3.37e-03 6.25e-06 0.91 0.01
4600 3.69e-03 3.37e-03 6.25e-06 0.08 0.01
4700 3.86e-03 3.38e-03 6.25e-06 0.07 0.01
4800 4.16e-03 3.41e-03 6.25e-06 0.09 0.01
4900 4.41e-03 3.39e-03 6.25e-06 0.08 0.01
5000 4.34e-03 3.38e-03 3.13e-06 0.07 0.01
4.6 model.pth文件和model.ptg文件
model.pth
位置:iter/iter.xx/01.train/model.pth
;这是DeePKS-kit中神经网络直接生成的模型文件,不可以在ABACUS中加载。
model.ptg
位置:iter/iter.{xx+1}/00.scf/model.ptg
;这是经过调整格式的model.pth文件,可以在ABACUS中加载。
DeePKS过程会自动将model.pth转为model.ptg
用户如需要手动将model.pth转换为model.ptg,需运行以下脚本:
import torch
import torch.nn as nn
from torch.nn import functional as F
from deepks.model import CorrNet
mp = CorrNet.load("model.pth")
mp.compile_save("model.ptg")
5. 运行DeePKS完整计算
DeePKS可以在本地或者Bohrium平台运行,因为涉及大量算例的计算,本地运行可能较慢,推荐使用Bohrium平台进行计算。
在本Notebook中,用户可以直接运行notebook进行DeePKS计算
如果要开发自己的DeePKS工程,环境配置可参考:
- 用户可以创建本notebook的副本,这样可以获得deepks-abacus:3.7.5镜像的使用权,可以基于此镜像创建容器节点。(推荐方法)
- 手动安装ABACUS,编译时注意选择deepks选项;手动安装deepks-kit(仓库中的develop分支)。(不推荐,可能会花大量时间解决版本兼容问题)
使用Bohrium平台运行计算,首先需要用户填写machines_dpdispatcher.yaml中的账户信息
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[30], line 11 9 email = input('Please enter your Bohrium account: ') 10 password = getpass.getpass('Please enter your password: ') ---> 11 program_id = int(input('Please enter your Bohrium Program ID: ')) 13 deepks_steps = ['scf_machine', 'train_machine'] 14 for i in deepks_steps: ValueError: invalid literal for int() with base 10: ''
在参数文件和算例文件准备好后,运行DeePKS非常简单,只需一行命令:
run_dpdispatcher.sh为在bohrium平台提交整个计算任务的命令:
python -u -m deepks iterate machines_dpdispatcher.yaml params.yaml systems.yaml scf_abacus.yaml >> log.iter 2> err.iter
echo $! > PID
run.sh为在本地提交计算任务的命令:
python -u -m deepks iterate machines.yaml params.yaml systems.yaml scf_abacus.yaml >> log.iter 2> err.iter
运行以上命令后,经过数轮iter后会得到输出结果文件:
本示例运行每一步的log.data:
iter 00
Training:
Convergence:
26 / 30 = 0.86667
Energy:
ME: 1368.2168358234183
MAE: 1368.2168358234183
MARE: 243.23866784362883
Testing:
Convergence:
25 / 37 = 0.67568
Energy:
ME: 1520.2397938166193
MAE: 1520.2397938166193
MARE: 152.02295799320103
iter 01
Training:
Convergence:
29 / 30 = 0.96667
Energy:
ME: 1368.1824672322423
MAE: 1368.1824672322423
MARE: 243.23272421067358
Testing:
Convergence:
37 / 37 = 1.00000
Energy:
ME: 1520.2023625619443
MAE: 1520.2023625619443
MARE: 152.01989532970182
iter 02
Training:
Convergence:
30 / 30 = 1.00000
Energy:
ME: 1368.1610629806628
MAE: 1368.1610629806628
MARE: 243.22833841125077
Testing:
Convergence:
37 / 37 = 1.00000
Energy:
ME: 1520.1795806858859
MAE: 1520.1795806858859
MARE: 152.0185177052233
iter 03
Training:
Convergence:
30 / 30 = 1.00000
Energy:
ME: 1368.1504403544286
MAE: 1368.1504403544286
MARE: 243.22670281049608
Testing:
Convergence:
37 / 37 = 1.00000
Energy:
ME: 1520.1685884598576
MAE: 1520.1685884598576
MARE: 152.0181481054287
可以看到,随着iter次数的增加,引入模型后的算例收敛比例逐渐提高,教程中为了演示iter只设置为4,在真实环境下可能需要更多的循环次数以获得更精确的结果。
扩展:使用ABACUS生成的标签进行训练
以上训练过程使用的高精度标签为使用VASP计算的标签(文章发表时ABACUS尚不支持杂化泛函计算),由于ABACUS与VASP的能量无法直接对齐,需要额外进行线性拟合对齐能量(fit_elem = true),增加了算例收敛的难度。
当前,经过开发者的不懈努力,ABACUS的杂化泛函功能已经逐渐完善。在本教程中,我们使用 DeePKS实战(附录)|使用DeePKS init功能进行训练数据的生产 教程中的步骤重新生产了ABACUS的hse标签数据并进行训练。
注意:由于训练标签采用ABACUS计算,此时不需要使用能量的线性拟合,将params.yaml中的fit_elem设置为 false,其他参数保持一致。
使用ABACUS生产的hse能量标签进行训练的每一步的log.data:
iter 00
Training:
Convergence:
30 / 30 = 1.00000
Energy:
ME: 0.00368378714060024
MAE: 0.00368378714060024
MARE: 0.0006464547954338842
Testing:
Convergence:
37 / 37 = 1.00000
Energy:
ME: 0.004351077716376961
MAE: 0.004351077716376961
MARE: 0.0008902025868237289
iter 01
Training:
Convergence:
30 / 30 = 1.00000
Energy:
ME: 0.011588828770394835
MAE: 0.011588828770394835
MARE: 0.002227502191587721
Testing:
Convergence:
37 / 37 = 1.00000
Energy:
ME: 0.0132005898606474
MAE: 0.0132005898606474
MARE: 0.0019926605374009985
iter 02
Training:
Convergence:
30 / 30 = 1.00000
Energy:
ME: 0.0009461353226242863
MAE: 0.0010607801193524816
MARE: 0.0008397133021026094
Testing:
Convergence:
37 / 37 = 1.00000
Energy:
ME: 0.0013915810218104278
MAE: 0.0023093618118690683
MARE: 0.002012919579480596
iter 03
Training:
Convergence:
30 / 30 = 1.00000
Energy:
ME: 0.0009533607899356866
MAE: 0.0009956163737721605
MARE: 0.0007812292491053086
Testing:
Convergence:
37 / 37 = 1.00000
Energy:
ME: 0.0016709236648313054
MAE: 0.0024677634297846896
MARE: 0.0021576741265390237
对比使用vasp标签的训练log.data,可以看出使用ABACUS训练标签的模型在进行下一轮scf时更容易收敛(模型更加合理)。
附录 DeePKS教程资料汇总
官方文档 Github仓库:https://github.com/deepmodeling/deepks-kit DeePKS-kit:https://deepks-kit.readthedocs.io/en/latest/
文献资料
- Chen Y, Zhang L, Wang H, et al. DeePKS: A comprehensive data-driven approach toward chemically accurate density functional theory[J]. Journal of Chemical Theory and Computation, 2020, 17(1): 170-181.
- Chen Y, Zhang L, Wang H, et al. DeePKS-kit: A package for developing machine learning-based chemically accurate energy and density functional models[J]. Computer Physics Communications, 2023, 282: 108520.
- Ou Q, Tuo P, Li W, et al. DeePKS Model for Halide Perovskites with the Accuracy of a Hybrid Functional[J]. The Journal of Physical Chemistry C, 2023, 127(37): 18755-18764.
- Li W, Ou Q, Chen Y, et al. DeePKS+ ABACUS as a Bridge between Expensive Quantum Mechanical Models and Machine Learning Potentials[J]. The Journal of Physical Chemistry A, 2022, 126(49): 9154-9164.
- Xiao J, Chen Y X, Zhang L F, et al. A machine learning-based high-precision density functional method for drug-like molecules[J]. Artificial Intelligence Chemistry, 2024, 2(1): 100037.
- Liu X, Han Y, Li Z, et al. Dflow, a Python framework for constructing cloud-native AI-for-Science workflows[J]. arXiv preprint arXiv:2404.18392, 2024.
- Notebook
- 从 DFT 先去 DeePKS 再到 DeePMD | DeePKS基础篇 https://bohrium.dp.tech/notebooks/8742877753
- 从 DFT 先去 DeePKS 再到 DeePMD | DeePKS案例篇 + 增强采样 https://bohrium.dp.tech/notebooks/7144731675
- DeePKS-Training-Demo https://bohrium.dp.tech/notebooks/8241862855
- DeePKS4Perovskites https://bohrium.dp.tech/notebooks/3541365519
- 漫谈AI时代的科学计算与物理建模|AI视角下的密度泛函建模与DeePKS https://bohrium.dp.tech/notebooks/7814778540
- 课程&视频
- ABACUS-DeePKS课程 https://bohrium.dp.tech/courses/9834756192
- 张林峰_DeePKS: a machine learning assisted electronic structure model 20210703 https://www.bilibili.com/video/BV1zU4y1J7xj
- DeePKS:AI辅助的电子结构方法 - 张林峰 | 钰沐菡 公益公开课 20220307 https://www.bilibili.com/video/BV1KY411V7DJ
- DeePKS+ABACUS:AI辅助的电子结构方法 20220803 https://www.bilibili.com/video/BV1bt4y137Xj
- 欧琪:ABACUS+DeePKS 20220829 20220829 https://www.bilibili.com/video/BV1mt4y1E7pn
- 李文菲:DeePKS原理 20221129 https://www.bilibili.com/video/BV1de411P7MZ
- 欧琪:DeePKS使用介绍&上机实践 20221220 https://www.bilibili.com/video/BV1ZK411z7Do
- 欧琪:DeePKS-ABACUS 20221129 https://www.bilibili.com/video/BV1WK411R7Ja
量子御坂