中
Uni-Mol性质预测实战-回归任务-有机/电解液分子的熔点预测






更新于 2025-02-28
推荐镜像 :unimol-tools:0.0.1
推荐机型 :c3_m4_1 * NVIDIA T4
赞 3
3
6
目录
Uni-Mol性质预测实战-回归任务-有机/电解液分子的熔点预测
©️ Copyright 2025 @ Authors
作者:
曾博深 📨
汪鸿帅 📨
陈乐天 📨
日期:2025-02-28
共享协议:本作品采用 知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议 进行许可。
快速开始:点击上方的 开始连接 按钮,选择 unimol-tools:0.0.1镜像及任意GPU节点配置,稍等片刻即可运行。
代码
文本
案例背景
- 熔点(Melting point)用于描述一种物质从固态转变为液态所需要的温度。通常在恒压下,当物质受热而从固态转变为液态时,物质的温度并不会上升,直到所有的固态都已转变为液态后,温度才会继续上升。
- 在电池领域,电解液分子的熔点是一个衡量其稳定性和可用温度范围的重要物理量。优异的电解液材料要求满足较宽的液程,另外不同的应用场景需要选择具有适当熔点的电解液,以满足特定的性能要求。
- 通过对未知分子的熔点进行预测,将有助于我们从可能的化学空间中逆向筛选可用作电解液的材料。
代码
文本
Step0 安装Uni-Mol Tools
使用pip install 配置环境并安装 unimol_tools
。
使用本案例 默认镜像 则无需安装与配置环境。
代码
文本
[1]
import os
os.environ['HTTP_PROXY'] = 'http://ga.dp.tech:8118'
os.environ['HTTPS_PROXY'] = 'http://ga.dp.tech:8118'
代码
文本
[2]
# !pip install --upgrade pip
# !pip install torch joblib rdkit pyyaml addict tqdm matplotlib huggingface_hub seaborn
# !pip install numpy==1.22.4 pandas==1.4.0, scikit-learn==1.5.0
已隐藏输出
代码
文本
[3]
# !pip install unimol_tools
已隐藏输出
代码
文本
Step1: 读入数据
- 包含近2万个分子的SMILES式和熔点测量值数据(TARGET)
- TAREGET为连续数值(单位为摄氏度)
代码
文本
[4]
!wget -P ./ https://dp-public.oss-cn-beijing.aliyuncs.com/community/mp_test.csv
!wget -P ./ https://dp-public.oss-cn-beijing.aliyuncs.com/community/mp_train.csv
--2025-02-28 16:44:34-- https://dp-public.oss-cn-beijing.aliyuncs.com/community/mp_test.csv Resolving ga.dp.tech (ga.dp.tech)... 10.255.254.7, 10.255.254.37, 10.255.254.18 Connecting to ga.dp.tech (ga.dp.tech)|10.255.254.7|:8118... connected. Proxy request sent, awaiting response... 200 OK Length: 69686 (68K) [text/csv] Saving to: ‘./mp_test.csv.3’ mp_test.csv.3 100%[===================>] 68.05K --.-KB/s in 0.009s 2025-02-28 16:44:35 (7.24 MB/s) - ‘./mp_test.csv.3’ saved [69686/69686] --2025-02-28 16:44:35-- https://dp-public.oss-cn-beijing.aliyuncs.com/community/mp_train.csv Resolving ga.dp.tech (ga.dp.tech)... 10.255.254.37, 10.255.254.18, 10.255.254.7 Connecting to ga.dp.tech (ga.dp.tech)|10.255.254.37|:8118... connected. Proxy request sent, awaiting response... 200 OK Length: 626850 (612K) [text/csv] Saving to: ‘./mp_train.csv.3’ mp_train.csv.3 100%[===================>] 612.16K --.-KB/s in 0.05s 2025-02-28 16:44:36 (10.9 MB/s) - ‘./mp_train.csv.3’ saved [626850/626850]
代码
文本
Step2: 采样数据
- 由于我们的熔点数据集较大,在短时间内难以完成演示,这里分别随机采样10%用于训练和测试。
(若要选择所有数据训练,请自行调整数据比例) - 若感兴趣可以用完整训练集和测试集,预测效果会更好
代码
文本
[5]
import pandas as pd
train_data_total = pd.read_csv('./mp_train.csv')
#train_data = train_data_total.sample(frac=1, random_state=1) #随机采样10%用于训练
train_data = train_data_total # 使用所有数据训练
print("------------ Sampled Train Data ------------") #显示训练数据
print(train_data)
train_data.columns = ["SMILES", "TARGET"]
train_data.to_csv('./mp_train_1.csv') #将随机采样得到的数据集保存
print('\n')
test_data_total = pd.read_csv('./mp_test.csv')
#test_data = test_data_total.sample(frac=1, random_state=1)
test_data = test_data_total
print("------------ Sampled Test Data ------------")
print(test_data)
test_data.columns = ["SMILES", "TARGET"]
test_data.to_csv('./mp_test_1.csv') #将随机采样得到的数据集保存
------------ Sampled Train Data ------------ SMILES TARGET 0 O=S(=O)(Cl)c1ccc(Cl)cc1F 36.00 1 CN(C)c1cccc2c(S(N)(=O)=O)cccc12 219.75 2 CC(C)(C)C(O)C(Cc1ccc(Cl)cc1Cl)n1cncn1 148.00 3 BrC(=NNc1ccccc1)c1ccccc1 190.00 4 O=C(C(=O)c1cccs1)c1cccs1 83.00 ... ... ... 17610 O=C(O)[C@@H]1[C@@H]2C=C[C@@H](C2)[C@H]1C(=O)O 185.00 17611 CCCCCC1=CCCC1 -83.00 17612 Clc1nccn1C(c1ccccc1)(c1ccccc1)c1ccccc1 201.00 17613 O=C1COc2ccccc2N1 174.00 17614 CCCCOC(=O)CCCCC(=O)OCCCC -32.40 [17615 rows x 2 columns] ------------ Sampled Test Data ------------ SMILES TARGET 0 BrC(CCC(Br)C(Br)c1ccccc1)C(Br)c1ccccc1 194.000000 1 BrC(c1ccccc1)c1ccccc1 41.666667 2 BrC12CC3CC(CC(C3)C1)C2 119.000000 3 BrCBr -52.625000 4 BrCC(Br)(Br)CBr 10.750000 ... ... ... 1952 c1ccc2nonc2c1 54.000000 1953 c1ccc2nsnc2c1 43.000000 1954 c1cncnc1 21.333333 1955 c1coc(C2CNCCN2)c1 87.000000 1956 c1nc[nH]n1 120.000000 [1957 rows x 2 columns]
代码
文本
Step3: 数据集分布可视化
代码
文本
[6]
import matplotlib.pyplot as plt
bins = 30
plt.figure(figsize=(6, 5))
plt.hist(train_data["TARGET"],label="Train Data")
plt.hist(test_data["TARGET"],label="Test Data")
plt.ylabel("Count")
plt.xlabel("Melting Point (℃)")
plt.title("Distribution")
plt.legend(prop={'size': 12})
plt.tick_params(labelsize=14)
plt.tight_layout()
plt.savefig('./dataset_distribution_histogram.png',
format='png')
代码
文本
Step4: 训练模型
- 调用 uni-mol 工具对数据进行模型训练
代码
文本
[8]
from unimol_tools import MolTrain,MolPredict
import numpy as np
clf = MolTrain(task='regression', # 回归任务
data_type='molecule', # 数据类型:分子
epochs=20, # 迭代次数,表示模型遍历整个训练数据集的次数。
# 在每个epoch中,模型会根据训练数据进行参数更新,以降低预测误差。
learning_rate=0.0001,
batch_size=16,
early_stopping=5,
metrics='r2',
split='random',
weight_path='/opt/mamba/lib/python3.10/site-packages/unimol_tools/weights//opt/mamba/lib/python3.10/site-packages/unimol_tools/weights/model_4.pth',
save_path='./mp_train', # 模型存储路径
)
clf.fit('./mp_train_1.csv') #训练集文件
clf = MolPredict(load_model='./mp_train') #加载训练好的模型
2025-02-28 16:51:39 | unimol_tools/data/datareader.py | 193 | INFO | Uni-Mol Tools | Anomaly clean with 3 sigma threshold: 17615 -> 17585 Train: 80%|████████ | 705/879 [00:54<00:09, 18.77it/s, Epoch=Epoch 5/20, loss=0.1521, lr=0.0001]2025-02-28 16:51:46 | unimol_tools/data/conformer.py | 126 | INFO | Uni-Mol Tools | Start generating conformers... 0it [00:00, ?it/s] 2it [00:00, 19.17it/s] 6it [00:00, 23.83it/s] 19it [00:00, 46.43it/s] 34it [00:00, 73.60it/s] 43it [00:02, 14.53it/s] 323it [00:02, 204.60it/s] 402it [00:02, 217.77it/s] 402it [00:14, 217.77it/s] 438it [00:18, 14.06it/s] 919it [00:20, 46.12it/s] 1211it [00:23, 61.41it/s] 2571it [00:23, 217.78it/s] 2888it [00:27, 160.45it/s] 3184it [00:27, 192.06it/s] 3363it [00:28, 184.28it/s] 3493it [00:31, 143.61it/s] 3690it [00:31, 167.07it/s] 3766it [00:31, 181.11it/s] 3836it [00:33, 124.77it/s] 3992it [00:34, 133.14it/s] 4150it [00:34, 176.31it/s] 4202it [00:35, 176.42it/s] 4244it [00:35, 174.10it/s] 4286it [00:35, 186.25it/s] 4319it [00:35, 179.76it/s] 4347it [00:35, 169.58it/s] 4373it [00:36, 107.96it/s] 4391it [00:36, 99.04it/s] 4435it [00:38, 56.75it/s] 4658it [00:38, 165.98it/s] 4740it [00:38, 197.22it/s] 4776it [00:38, 206.40it/s] 4810it [00:39, 177.01it/s] 4840it [00:39, 175.41it/s] 4864it [00:39, 179.80it/s] 4887it [00:39, 144.07it/s] 4928it [00:40, 120.43it/s] 4967it [00:40, 141.82it/s] 5028it [00:40, 205.16it/s] 5062it [00:40, 226.58it/s] 5094it [00:40, 169.94it/s] 5120it [00:41, 129.67it/s] 5208it [00:41, 229.35it/s] 5246it [00:41, 202.90it/s] 5278it [00:43, 75.35it/s] 5301it [00:48, 17.29it/s] 5606it [00:56, 31.58it/s] 5800it [01:11, 19.31it/s] 8972it [01:12, 203.24it/s] 9332it [01:14, 199.81it/s] 9589it [01:22, 119.52it/s] 10400it [01:23, 177.19it/s] 10599it [01:24, 175.19it/s] 10743it [01:26, 149.25it/s] 10743it [01:44, 149.25it/s] 10837it [01:45, 43.81it/s] 13403it [01:45, 186.55it/s] 13780it [01:47, 188.66it/s] 14050it [01:49, 181.34it/s] 14243it [01:50, 171.25it/s] 14382it [02:00, 81.64it/s] 15355it [02:00, 164.38it/s] 15623it [02:01, 167.75it/s] 15816it [02:03, 159.41it/s] 15955it [02:03, 164.93it/s] 16059it [02:05, 129.79it/s] 16253it [02:06, 140.69it/s] 16412it [02:07, 176.46it/s] 16497it [02:07, 169.16it/s] 16561it [02:08, 166.26it/s] 16610it [02:08, 171.25it/s] 16651it [02:08, 168.43it/s] 16684it [02:08, 163.07it/s] 16712it [02:09, 141.17it/s] 16751it [02:09, 133.29it/s] 16787it [02:09, 139.70it/s] 16817it [02:10, 138.25it/s] 16834it [02:10, 121.21it/s] 16885it [02:10, 150.68it/s] 16937it [02:10, 163.97it/s] 16991it [02:10, 208.06it/s] 17017it [02:11, 109.32it/s] 17104it [02:11, 160.18it/s] 17127it [02:12, 145.56it/s] 17187it [02:14, 46.36it/s] 17585it [02:15, 130.00it/s] 2025-02-28 16:54:02 | unimol_tools/data/conformer.py | 135 | INFO | Uni-Mol Tools | Succeeded in generating conformers for 100.00% of molecules. 2025-02-28 16:54:02 | unimol_tools/data/conformer.py | 142 | INFO | Uni-Mol Tools | Succeeded in generating 3d conformers for 99.78% of molecules. 2025-02-28 16:54:02 | unimol_tools/data/conformer.py | 145 | INFO | Uni-Mol Tools | Failed 3d conformers indices: [41, 273, 437, 805, 892, 918, 1210, 2435, 2649, 3081, 3501, 4133, 5075, 5296, 5605, 5799, 6575, 6601, 6639, 7738, 7959, 8644, 9528, 10250, 10593, 10836, 11039, 11041, 12041, 12313, 12705, 13519, 14296, 14901, 15596, 15933, 16996, 17262] 2025-02-28 16:54:02 | unimol_tools/data/datahub.py | 112 | INFO | Uni-Mol Tools | Split method: random, fold: 5 2025-02-28 16:54:02 | unimol_tools/train.py | 202 | INFO | Uni-Mol Tools | Output directory already exists: ./mp_train 2025-02-28 16:54:02 | unimol_tools/train.py | 203 | INFO | Uni-Mol Tools | Warning: Overwrite output directory: ./mp_train 2025-02-28 16:54:02 | unimol_tools/models/unimol.py | 120 | INFO | Uni-Mol Tools | Loading pretrained weights from /opt/mamba/lib/python3.10/site-packages/unimol_tools/weights/mol_pre_all_h_220816.pt 2025-02-28 16:54:03 | unimol_tools/models/nnmodel.py | 144 | INFO | Uni-Mol Tools | start training Uni-Mol:unimolv1 2025-02-28 16:54:55 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [1/20] train_loss: 0.4635, val_loss: 0.4275, val_r2: 0.5825, lr: 0.000098, 52.0s 2025-02-28 16:55:47 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [2/20] train_loss: 0.2808, val_loss: 0.2189, val_r2: 0.7861, lr: 0.000093, 51.9s 2025-02-28 16:57:32 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [4/20] train_loss: 0.1856, val_loss: 0.1807, val_r2: 0.8234, lr: 0.000082, 52.2s 2025-02-28 16:58:25 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [5/20] train_loss: 0.1532, val_loss: 0.1948, val_r2: 0.8096, lr: 0.000077, 52.5s 2025-02-28 17:01:04 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [8/20] train_loss: 0.0823, val_loss: 0.1748, val_r2: 0.8293, lr: 0.000062, 53.2s 2025-02-28 17:01:59 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [9/20] train_loss: 0.0684, val_loss: 0.1888, val_r2: 0.8155, lr: 0.000057, 55.4s 2025-02-28 17:02:54 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [10/20] train_loss: 0.0548, val_loss: 0.2044, val_r2: 0.8003, lr: 0.000052, 54.3s 2025-02-28 17:03:48 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [11/20] train_loss: 0.0453, val_loss: 0.1767, val_r2: 0.8274, lr: 0.000046, 53.9s Train: 34%|███▎ | 296/879 [00:16<00:32, 18.07it/s, Epoch=Epoch 12/20, loss=0.0364, lr=0.0000]
代码
文本
Step5: 预测熔点
代码
文本
[ ]
# 通过画出实验值和预测值来可视化我们的模型训练结果,比对测试集的实验值和预测值
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
clf = MolPredict(load_model='./mp_train') #加载训练好的模型
predict = clf.predict('./mp_test_1.csv').reshape(-1)
test_set = pd.read_csv("./mp_test_1.csv",header='infer') #读取实验数据文件
test_mp = test_set["TARGET"].to_numpy() #提取eps值
# 计算预测值和实验值的范围,用于设定图像的坐标轴范围
xmin = min(predict.flatten().min(), test_mp.min())
xmax = max(predict.flatten().max(), test_mp.max())
ymin = xmin
ymax = xmax
代码
文本
[ ]
# 设置图像大小
plt.figure(figsize=(7, 6))
# 设置x轴和y轴的范围,
plt.xlim(xmin, xmax)
plt.ylim(ymin, ymax)
# 添加X轴和Y轴标签
plt.xlabel('Predicted Melting Point', fontsize=14)
plt.ylabel('Experimental Melting Point', fontsize=14)
# 添加标题
plt.title('Experimental vs Predicted Melting Point', fontsize=16)
# 绘制散点图
plt.scatter(predict, test_mp, color='blue', alpha=0.6)
# 绘制y=x的直线
x = np.linspace(xmin, xmax)
plt.plot(x, x, color='red', linestyle='--', linewidth=2)
# 显示图形
plt.show()
代码
文本
练习
可以尝试用整个训练集和测试集再跑一轮,比较两次预测效果。
代码
文本
已赞3
本文被以下合集收录
电解液案例

Piloteye
更新于 2025-01-19
4 篇8 人关注
QSAR

Valdsere

更新于 2024-12-11
3 篇0 人关注