

快速上手DeePMD数据集准备 | CP2K篇
©️ Copyright 2023 @ Authors
作者:宋哲轩 📨
日期:2023-09-14
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:点击上方的 开始连接 按钮,选择
ubuntu:22.04-py3.10 镜像及任意节点配置,稍等片刻即可运行。
本文使用dpdata对CP2K生成的第一性原理数据(单点能计算,在本文简记为fp & 从头算分子动力学aimd)进行格式转化,得到DeePMD-kit可以直接使用的npy格式
学习基础:CP2K基本知识
软件使用:CP2K(2022.2), dpdata(0.2.16), cp2kdata(v0.4.3)
dpdata/cp2kdata对现有CP2K版本支持情况:fp_output(-2023), AIMD_output(-2023)
* 🎉最新版的dpdata/cp2kdata对CP2K的支持版本进行了较大更新——CP2K7.1, 8.x, 9.x, 2022等(详情见github主页)。
特别感谢:@njzjz @robinzyb @ChiahsinChu @link89 ...每一位开发者的贡献~💕
1. CP2K输入文件要求
PRINT LEVERL
选择MEDIUM
及以上对于单点计算时,选择
RUN_TYPE ENERGY_FORCE
打开输出力的选项:
fp:
&FORCE_EVAL &PRINT &FORCES
aimd:
&MOTION &PRINT &FORCES
当前文件夹下包含了从头算分子动力学(aimd)和单点计算(fp)的输出结果
文件夹下还提供了lbg直接提交CP2K任务的job_cp2k.json
./00_fp ├── 000 │ ├── STDOUTERR │ ├── input.inp │ ├── job_cp2k.json │ ├── lbg-13387-8917799.sh │ ├── lbg-13387-8917799.sh.bak │ ├── lbg-13387-8918922.sh │ ├── lbg-13387-8918922.sh.bak │ └── output.log └── 001 ├── STDOUTERR ├── input.inp ├── job_cp2k.json ├── lbg-13387-8917816.sh ├── lbg-13387-8917816.sh.bak ├── lbg-13387-8918924.sh ├── lbg-13387-8918924.sh.bak └── output.log 2 directories, 16 files ./00_aimd ├── 000 │ ├── STDOUTERR │ ├── aimd-1.ener │ ├── aimd-1.restart │ ├── aimd-1.restart.bak-1 │ ├── aimd-1.restart.bak-2 │ ├── aimd-frc-1.xyz │ ├── aimd-pos-1.xyz │ ├── input.inp │ ├── job_cp2k.json │ ├── lbg-13387-8917653.sh │ ├── lbg-13387-8917653.sh.bak │ ├── lbg-13387-8918919.sh │ ├── lbg-13387-8918919.sh.bak │ └── output.log └── 001 ├── STDOUTERR ├── aimd-1.ener ├── aimd-1.restart ├── aimd-1.restart.bak-1 ├── aimd-1.restart.bak-2 ├── aimd-frc-1.xyz ├── aimd-pos-1.xyz ├── input.inp ├── job_cp2k.json ├── lbg-13387-8917688.sh ├── lbg-13387-8917688.sh.bak ├── lbg-13387-8918920.sh ├── lbg-13387-8918920.sh.bak └── output.log 2 directories, 28 files
2. fp输出处理
2.1 单个结构
dpdata.LabeledSystem('file',fmt='cp2k/output')
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: dpdata in /opt/mamba/lib/python3.10/site-packages (0.2.14) Collecting dpdata Downloading https://pypi.tuna.tsinghua.edu.cn/packages/44/cd/e769009f379cb693d51c69022a7e80dea2685a8283515b883c23336a9de4/dpdata-0.2.16-py3-none-any.whl (137 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 137.3/137.3 kB 2.6 MB/s eta 0:00:00a 0:00:01 Requirement already satisfied: h5py in /opt/mamba/lib/python3.10/site-packages (from dpdata) (3.8.0) Requirement already satisfied: monty in /opt/mamba/lib/python3.10/site-packages (from dpdata) (2022.9.9) Requirement already satisfied: wcmatch in /opt/mamba/lib/python3.10/site-packages (from dpdata) (8.4.1) Requirement already satisfied: numpy>=1.14.3 in /opt/mamba/lib/python3.10/site-packages (from dpdata) (1.24.2) Requirement already satisfied: scipy in /opt/mamba/lib/python3.10/site-packages (from dpdata) (1.10.1) Requirement already satisfied: bracex>=2.1.1 in /opt/mamba/lib/python3.10/site-packages (from wcmatch->dpdata) (2.3.post1) Installing collected packages: dpdata Attempting uninstall: dpdata Found existing installation: dpdata 0.2.14 Uninstalling dpdata-0.2.14: Successfully uninstalled dpdata-0.2.14 Successfully installed dpdata-0.2.16 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting cp2kdata Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cd/e4/5cfaea8d3f073592c9f2345dbbbbd75b234586c9c70fc8b3c838f1fd31ae/Cp2kData-0.4.3-py3-none-any.whl (39 kB) Collecting numpy>=1.24.3 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/71/3c/3b1981c6a1986adc9ee7db760c0c34ea5b14ac3da9ecfcf1ea2a4ec6c398/numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 22.5 MB/s eta 0:00:0000:0100:01 Requirement already satisfied: monty in /opt/mamba/lib/python3.10/site-packages (from cp2kdata) (2022.9.9) Collecting matplotlib>=3.3.2 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f6/ec/bd1fb2338012d6ff57046ab73bf9c962b12c80d8e5848bf233846ebbc876/matplotlib-3.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.6/11.6 MB 12.3 MB/s eta 0:00:0000:0100:01 Requirement already satisfied: scipy>=1.5.4 in /opt/mamba/lib/python3.10/site-packages (from cp2kdata) (1.10.1) Collecting regex Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d1/df/460ca6171a8494fcf37af43f52f6fac23e38784bb4a26563f6fa01ef6faf/regex-2023.8.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 771.9/771.9 kB 21.6 MB/s eta 0:00:0000:01 Requirement already satisfied: dpdata in /opt/mamba/lib/python3.10/site-packages (from cp2kdata) (0.2.16) Collecting click Downloading https://pypi.tuna.tsinghua.edu.cn/packages/00/2e/d53fa4befbf2cfa713304affc7ca780ce4fc1fd8710527771b58311a3229/click-8.1.7-py3-none-any.whl (97 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.9/97.9 kB 29.0 MB/s eta 0:00:00 Collecting ase>=3.20.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/38/b0/3c0a7afaf66274588216c251376ac2bea0269eb7a5e1da77521811060553/ase-3.22.1-py3-none-any.whl (2.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 26.0 MB/s eta 0:00:00a 0:00:01 Requirement already satisfied: python-dateutil>=2.7 in /opt/mamba/lib/python3.10/site-packages (from matplotlib>=3.3.2->cp2kdata) (2.8.2) Collecting contourpy>=1.0.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/aa/55/02c6d24804592b862b38a85c9b3283edc245081390a520ccd11697b6b24f/contourpy-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (300 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 300.7/300.7 kB 30.4 MB/s eta 0:00:00 Collecting kiwisolver>=1.0.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/6f/40/4ab1fdb57fced80ce5903f04ae1aed7c1d5939dda4fd0c0aa526c12fe28a/kiwisolver-1.4.5-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 33.0 MB/s eta 0:00:00a 0:00:01 Collecting cycler>=0.10 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/5c/f9/695d6bedebd747e5eb0fe8fad57b72fdf25411273a39791cde838d5a8f51/cycler-0.11.0-py3-none-any.whl (6.4 kB) Collecting pyparsing>=2.3.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/39/92/8486ede85fcc088f1b3dba4ce92dd29d126fd96b0008ea213167940a2475/pyparsing-3.1.1-py3-none-any.whl (103 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 103.1/103.1 kB 22.5 MB/s eta 0:00:00 Collecting pillow>=6.2.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/3d/36/e78f09d510354977e10102dd811e928666021d9c451e05df962d56477772/Pillow-10.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (3.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 45.3 MB/s eta 0:00:0000:0100:01 Collecting fonttools>=4.22.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2b/e8/61b8525acf26ec222518bdff127ae502bfa3408981fb5e5493f2b037d7fb/fonttools-4.42.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 65.0 MB/s eta 0:00:0000:0100:01 Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib>=3.3.2->cp2kdata) (23.0) Requirement already satisfied: h5py in /opt/mamba/lib/python3.10/site-packages (from dpdata->cp2kdata) (3.8.0) Requirement already satisfied: wcmatch in /opt/mamba/lib/python3.10/site-packages (from dpdata->cp2kdata) (8.4.1) Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib>=3.3.2->cp2kdata) (1.16.0) Requirement already satisfied: bracex>=2.1.1 in /opt/mamba/lib/python3.10/site-packages (from wcmatch->dpdata->cp2kdata) (2.3.post1) Installing collected packages: regex, pyparsing, pillow, numpy, kiwisolver, fonttools, cycler, click, contourpy, matplotlib, ase, cp2kdata Attempting uninstall: numpy Found existing installation: numpy 1.24.2 Uninstalling numpy-1.24.2: Successfully uninstalled numpy-1.24.2 Successfully installed ase-3.22.1 click-8.1.7 contourpy-1.1.0 cp2kdata-0.4.3 cycler-0.11.0 fonttools-4.42.1 kiwisolver-1.4.5 matplotlib-3.7.3 numpy-1.25.2 pillow-10.0.0 pyparsing-3.1.1 regex-2023.8.8 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Data Summary Labeled System ------------------- Frame Numbers : 1 Atom Numbers : 1000 Including Virials : No Element List : ------------------- C O H 300 300 400
通过打印的输出信息,我们得到data读取了1帧结构,其含有1000个原子数目,无维里信息(取决于实际需求,维里信息非训练必要数据);体系包含C,O,H三种元素,对应原子数目为300,300,400
🎈小tip:如果我们希望dpdata按照指定元素顺序输出,可增加type_map
关键词:
Data Summary Labeled System ------------------- Frame Numbers : 1 Atom Numbers : 1000 Including Virials : No Element List : ------------------- H C O 400 300 300
./01_dp/fp_single/ ├── set.000 │ ├── box.npy │ ├── coord.npy │ ├── energy.npy │ └── force.npy ├── type.raw └── type_map.raw 1 directory, 6 files
至此,deepmd-npy格式的文件已经输出至指定位置
其中,各文件的含义如下:
type_map.raw
: 体系的元素,其位置序号为0, 1, 2...type.raw
: 体系的原子对应的元素种类,其中0,1,2即type.raw中对应位置的元素box/coord/energy/force.npy
: 记录每帧结构的盒子、坐标、能量和力的信息
1 [-183762.88766277]
现在,你已经掌握了dpdata的基本使用技巧,并学会读取CP2K单个结构文件,输出成deepmd-npy格式。不过在实际问题中,我们常常需要批量处理多个单点能的文件,这时应该怎么做?
2.2 多个文件下的文件
dpdata.MultiSystem.from_dir('dir_path',file_name='file',fmt='cp2k/output')
MultiSystems (1 systems containing 2 frames) {'C300O300H400': Data Summary Labeled System ------------------- Frame Numbers : 2 Atom Numbers : 1000 Including Virials : No Element List : ------------------- C O H 300 300 400}
看来只需稍加修改关键词,就可以实现对多个文件夹的输出信息读取,我们还按照2.1的步骤将data输出至01_fp/dp_all位置
./01_dp/fp_all/ ├── C150O150H200 │ ├── set.000 │ │ ├── box.npy │ │ ├── coord.npy │ │ ├── energy.npy │ │ └── force.npy │ ├── type.raw │ └── type_map.raw └── C300O300H400 ├── set.000 │ ├── box.npy │ ├── coord.npy │ ├── energy.npy │ └── force.npy ├── type.raw └── type_map.raw 4 directories, 12 files
3. AIMD输出处理
3.1 单条轨迹
dpdata.LabeledSystem('dir_path',cp2k_output_name='output.log',fmt='cp2kdata/md')
--- You are parsing data using package Cp2kData --- You are reading cell information from ./00_aimd/000/output.log Obtian Energies From ./00_aimd/000/aimd-1.ener Obtian Structures From ./00_aimd/000/aimd-pos-1.xyz Obtian Froces From ./00_aimd/000/aimd-frc-1.xyz Atom names are fake chemical symbols as you set in cp2k input. --- You are parsing data using package Cp2kData --- Data Summary Labeled System ------------------- Frame Numbers : 51 Atom Numbers : 500 Including Virials : No Element List : ------------------- C O H 150 150 200 ./01_dp/aimd_single/ ├── set.000 │ ├── box.npy │ ├── coord.npy │ ├── energy.npy │ └── force.npy ├── type.raw └── type_map.raw 1 directory, 6 files
3.2 多条轨迹
在批量处理轨迹时,目前dpdata.MultiSystems不支持cp2k/aimd_output格式。因此,我们编写脚本进行MultiSystems() + LabeledSystem()
读取转化。
data=dpdata.LabeledSystem('dir_path',cp2k_output_name='file',fmt='cp2kdata/md')
--- You are parsing data using package Cp2kData --- You are reading cell information from ./00_aimd/000/output.log Obtian Energies From ./00_aimd/000/aimd-1.ener Obtian Structures From ./00_aimd/000/aimd-pos-1.xyz Obtian Froces From ./00_aimd/000/aimd-frc-1.xyz Atom names are fake chemical symbols as you set in cp2k input. --- You are parsing data using package Cp2kData --- --- You are parsing data using package Cp2kData --- You are reading cell information from ./00_aimd/001/output.log Obtian Energies From ./00_aimd/001/aimd-1.ener Obtian Structures From ./00_aimd/001/aimd-pos-1.xyz Obtian Froces From ./00_aimd/001/aimd-frc-1.xyz Atom names are fake chemical symbols as you set in cp2k input. --- You are parsing data using package Cp2kData --- MultiSystems (1 systems containing 102 frames) ./01_dp/aimd_all/ └── C150O150H200 ├── set.000 │ ├── box.npy │ ├── coord.npy │ ├── energy.npy │ └── force.npy ├── type.raw └── type_map.raw 2 directories, 6 files
从data的输出结果,我们可以看到多个文件夹的轨迹均被读取,共计102帧结构。如果想查看体系详细信息,可以使用data.systems
查看
{'C150O150H200': Data Summary Labeled System ------------------- Frame Numbers : 102 Atom Numbers : 500 Including Virials : No Element List : ------------------- C O H 150 150 200}
😄恭喜你!已经学会如何借助dpdata对CP2K各种类型的输出文件进行读取转化。
如果想体验一把DP全家桶(train/finetune...),不妨直接将自己的01_dp打包成tgz格式,上传即用!
01_dp/ 01_dp/aimd_all/ 01_dp/aimd_all/C150O150H200/ 01_dp/aimd_all/C150O150H200/set.000/ 01_dp/aimd_all/C150O150H200/set.000/box.npy 01_dp/aimd_all/C150O150H200/set.000/coord.npy 01_dp/aimd_all/C150O150H200/set.000/energy.npy 01_dp/aimd_all/C150O150H200/set.000/force.npy 01_dp/aimd_all/C150O150H200/type.raw 01_dp/aimd_all/C150O150H200/type_map.raw 01_dp/aimd_single/ 01_dp/aimd_single/set.000/ 01_dp/aimd_single/set.000/box.npy 01_dp/aimd_single/set.000/coord.npy 01_dp/aimd_single/set.000/energy.npy 01_dp/aimd_single/set.000/force.npy 01_dp/aimd_single/type.raw 01_dp/aimd_single/type_map.raw 01_dp/fp_all/ 01_dp/fp_all/C150O150H200/ 01_dp/fp_all/C150O150H200/set.000/ 01_dp/fp_all/C150O150H200/set.000/box.npy 01_dp/fp_all/C150O150H200/set.000/coord.npy 01_dp/fp_all/C150O150H200/set.000/energy.npy 01_dp/fp_all/C150O150H200/set.000/force.npy 01_dp/fp_all/C150O150H200/type.raw 01_dp/fp_all/C150O150H200/type_map.raw 01_dp/fp_all/C300O300H400/ 01_dp/fp_all/C300O300H400/set.000/ 01_dp/fp_all/C300O300H400/set.000/box.npy 01_dp/fp_all/C300O300H400/set.000/coord.npy 01_dp/fp_all/C300O300H400/set.000/energy.npy 01_dp/fp_all/C300O300H400/set.000/force.npy 01_dp/fp_all/C300O300H400/type.raw 01_dp/fp_all/C300O300H400/type_map.raw 01_dp/fp_single/ 01_dp/fp_single/set.000/ 01_dp/fp_single/set.000/box.npy 01_dp/fp_single/set.000/coord.npy 01_dp/fp_single/set.000/energy.npy 01_dp/fp_single/set.000/force.npy 01_dp/fp_single/type.raw 01_dp/fp_single/type_map.raw








AnguseZhang
AnguseZhang
AnguseZhang
ZhexuanS