探究
实验室
计算
公开
Uni-Mol性质预测实战-回归任务-有机/电解液分子的熔点预测
Uni-Mol
Deep Learning
中文
Uni-MolDeep Learning中文
陈乐天 Letian Chen
zengboshen@dp.tech
wanghongshuai@dp.tech
更新于 2025-02-28
推荐镜像 :unimol-tools:0.0.1
推荐机型 :c3_m4_1 * NVIDIA T4
赞 3
3
6
Uni-Mol性质预测实战-回归任务-有机/电解液分子的熔点预测
案例背景
Step0 安装Uni-Mol Tools
Step1: 读入数据
Step2: 采样数据
Step3: 数据集分布可视化
Step4: 训练模型
Step5: 预测熔点
练习

Uni-Mol性质预测实战-回归任务-有机/电解液分子的熔点预测

©️ Copyright 2025 @ Authors
作者: 曾博深 📨 汪鸿帅 📨 陈乐天 📨
日期:2025-02-28
共享协议:本作品采用 知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议 进行许可。
快速开始:点击上方的 开始连接 按钮,选择 unimol-tools:0.0.1镜像及任意GPU节点配置,稍等片刻即可运行。

代码
文本

案例背景

  • 熔点(Melting point)用于描述一种物质从固态转变为液态所需要的温度。通常在恒压下,当物质受热而从固态转变为液态时,物质的温度并不会上升,直到所有的固态都已转变为液态后,温度才会继续上升。
  • 在电池领域,电解液分子的熔点是一个衡量其稳定性可用温度范围的重要物理量。优异的电解液材料要求满足较宽的液程,另外不同的应用场景需要选择具有适当熔点的电解液,以满足特定的性能要求。
  • 通过对未知分子的熔点进行预测,将有助于我们从可能的化学空间中逆向筛选可用作电解液的材料。
代码
文本

Step0 安装Uni-Mol Tools

使用pip install 配置环境并安装 unimol_tools

使用本案例 默认镜像 则无需安装与配置环境。

代码
文本
[1]
import os
os.environ['HTTP_PROXY'] = 'http://ga.dp.tech:8118'
os.environ['HTTPS_PROXY'] = 'http://ga.dp.tech:8118'
代码
文本
[2]
# !pip install --upgrade pip
# !pip install torch joblib rdkit pyyaml addict tqdm matplotlib huggingface_hub seaborn
# !pip install numpy==1.22.4 pandas==1.4.0, scikit-learn==1.5.0
已隐藏输出
代码
文本
[3]
# !pip install unimol_tools
已隐藏输出
代码
文本

Step1: 读入数据

  • 包含近2万个分子的SMILES式熔点测量值数据(TARGET)
  • TAREGET为连续数值(单位为摄氏度)
代码
文本
[4]
!wget -P ./ https://dp-public.oss-cn-beijing.aliyuncs.com/community/mp_test.csv
!wget -P ./ https://dp-public.oss-cn-beijing.aliyuncs.com/community/mp_train.csv
--2025-02-28 16:44:34--  https://dp-public.oss-cn-beijing.aliyuncs.com/community/mp_test.csv
Resolving ga.dp.tech (ga.dp.tech)... 10.255.254.7, 10.255.254.37, 10.255.254.18
Connecting to ga.dp.tech (ga.dp.tech)|10.255.254.7|:8118... connected.
Proxy request sent, awaiting response... 200 OK
Length: 69686 (68K) [text/csv]
Saving to: ‘./mp_test.csv.3’

mp_test.csv.3       100%[===================>]  68.05K  --.-KB/s    in 0.009s  

2025-02-28 16:44:35 (7.24 MB/s) - ‘./mp_test.csv.3’ saved [69686/69686]

--2025-02-28 16:44:35--  https://dp-public.oss-cn-beijing.aliyuncs.com/community/mp_train.csv
Resolving ga.dp.tech (ga.dp.tech)... 10.255.254.37, 10.255.254.18, 10.255.254.7
Connecting to ga.dp.tech (ga.dp.tech)|10.255.254.37|:8118... connected.
Proxy request sent, awaiting response... 200 OK
Length: 626850 (612K) [text/csv]
Saving to: ‘./mp_train.csv.3’

mp_train.csv.3      100%[===================>] 612.16K  --.-KB/s    in 0.05s   

2025-02-28 16:44:36 (10.9 MB/s) - ‘./mp_train.csv.3’ saved [626850/626850]

代码
文本

Step2: 采样数据

  • 由于我们的熔点数据集较大,在短时间内难以完成演示,这里分别随机采样10%用于训练和测试。
    若要选择所有数据训练,请自行调整数据比例
  • 若感兴趣可以用完整训练集和测试集,预测效果会更好
代码
文本
[5]
import pandas as pd

train_data_total = pd.read_csv('./mp_train.csv')
#train_data = train_data_total.sample(frac=1, random_state=1) #随机采样10%用于训练
train_data = train_data_total # 使用所有数据训练
print("------------ Sampled Train Data ------------") #显示训练数据
print(train_data)
train_data.columns = ["SMILES", "TARGET"]
train_data.to_csv('./mp_train_1.csv') #将随机采样得到的数据集保存
print('\n')

test_data_total = pd.read_csv('./mp_test.csv')
#test_data = test_data_total.sample(frac=1, random_state=1)
test_data = test_data_total
print("------------ Sampled Test Data ------------")
print(test_data)
test_data.columns = ["SMILES", "TARGET"]
test_data.to_csv('./mp_test_1.csv') #将随机采样得到的数据集保存
------------ Sampled Train Data ------------
                                              SMILES  TARGET
0                           O=S(=O)(Cl)c1ccc(Cl)cc1F   36.00
1                    CN(C)c1cccc2c(S(N)(=O)=O)cccc12  219.75
2              CC(C)(C)C(O)C(Cc1ccc(Cl)cc1Cl)n1cncn1  148.00
3                           BrC(=NNc1ccccc1)c1ccccc1  190.00
4                           O=C(C(=O)c1cccs1)c1cccs1   83.00
...                                              ...     ...
17610  O=C(O)[C@@H]1[C@@H]2C=C[C@@H](C2)[C@H]1C(=O)O  185.00
17611                                  CCCCCC1=CCCC1  -83.00
17612         Clc1nccn1C(c1ccccc1)(c1ccccc1)c1ccccc1  201.00
17613                               O=C1COc2ccccc2N1  174.00
17614                       CCCCOC(=O)CCCCC(=O)OCCCC  -32.40

[17615 rows x 2 columns]


------------ Sampled Test Data ------------
                                      SMILES      TARGET
0     BrC(CCC(Br)C(Br)c1ccccc1)C(Br)c1ccccc1  194.000000
1                      BrC(c1ccccc1)c1ccccc1   41.666667
2                     BrC12CC3CC(CC(C3)C1)C2  119.000000
3                                      BrCBr  -52.625000
4                            BrCC(Br)(Br)CBr   10.750000
...                                      ...         ...
1952                           c1ccc2nonc2c1   54.000000
1953                           c1ccc2nsnc2c1   43.000000
1954                                c1cncnc1   21.333333
1955                       c1coc(C2CNCCN2)c1   87.000000
1956                              c1nc[nH]n1  120.000000

[1957 rows x 2 columns]
代码
文本

Step3: 数据集分布可视化

代码
文本
[6]
import matplotlib.pyplot as plt

bins = 30
plt.figure(figsize=(6, 5))
plt.hist(train_data["TARGET"],label="Train Data")
plt.hist(test_data["TARGET"],label="Test Data")

plt.ylabel("Count")
plt.xlabel("Melting Point (℃)")
plt.title("Distribution")
plt.legend(prop={'size': 12})
plt.tick_params(labelsize=14)
plt.tight_layout()

plt.savefig('./dataset_distribution_histogram.png',
format='png')
代码
文本

Step4: 训练模型

  • 调用 uni-mol 工具对数据进行模型训练
代码
文本
[8]
from unimol_tools import MolTrain,MolPredict
import numpy as np

clf = MolTrain(task='regression', # 回归任务
data_type='molecule', # 数据类型:分子
epochs=20, # 迭代次数,表示模型遍历整个训练数据集的次数。
# 在每个epoch中,模型会根据训练数据进行参数更新,以降低预测误差。
learning_rate=0.0001,
batch_size=16,
early_stopping=5,
metrics='r2',
split='random',
weight_path='/opt/mamba/lib/python3.10/site-packages/unimol_tools/weights//opt/mamba/lib/python3.10/site-packages/unimol_tools/weights/model_4.pth',
save_path='./mp_train', # 模型存储路径
)
clf.fit('./mp_train_1.csv') #训练集文件
clf = MolPredict(load_model='./mp_train') #加载训练好的模型
2025-02-28 16:51:39 | unimol_tools/data/datareader.py | 193 | INFO | Uni-Mol Tools | Anomaly clean with 3 sigma threshold: 17615 -> 17585
Train:  80%|████████  | 705/879 [00:54<00:09, 18.77it/s, Epoch=Epoch 5/20, loss=0.1521, lr=0.0001]2025-02-28 16:51:46 | unimol_tools/data/conformer.py | 126 | INFO | Uni-Mol Tools | Start generating conformers...

0it [00:00, ?it/s]
2it [00:00, 19.17it/s]
6it [00:00, 23.83it/s]
19it [00:00, 46.43it/s]
34it [00:00, 73.60it/s]
43it [00:02, 14.53it/s]
323it [00:02, 204.60it/s]
402it [00:02, 217.77it/s]
402it [00:14, 217.77it/s]
438it [00:18, 14.06it/s] 
919it [00:20, 46.12it/s]
1211it [00:23, 61.41it/s]
2571it [00:23, 217.78it/s]
2888it [00:27, 160.45it/s]
3184it [00:27, 192.06it/s]
3363it [00:28, 184.28it/s]
3493it [00:31, 143.61it/s]
3690it [00:31, 167.07it/s]
3766it [00:31, 181.11it/s]
3836it [00:33, 124.77it/s]
3992it [00:34, 133.14it/s]
4150it [00:34, 176.31it/s]
4202it [00:35, 176.42it/s]
4244it [00:35, 174.10it/s]
4286it [00:35, 186.25it/s]
4319it [00:35, 179.76it/s]
4347it [00:35, 169.58it/s]
4373it [00:36, 107.96it/s]
4391it [00:36, 99.04it/s] 
4435it [00:38, 56.75it/s]
4658it [00:38, 165.98it/s]
4740it [00:38, 197.22it/s]
4776it [00:38, 206.40it/s]
4810it [00:39, 177.01it/s]
4840it [00:39, 175.41it/s]
4864it [00:39, 179.80it/s]
4887it [00:39, 144.07it/s]
4928it [00:40, 120.43it/s]
4967it [00:40, 141.82it/s]
5028it [00:40, 205.16it/s]
5062it [00:40, 226.58it/s]
5094it [00:40, 169.94it/s]
5120it [00:41, 129.67it/s]
5208it [00:41, 229.35it/s]
5246it [00:41, 202.90it/s]
5278it [00:43, 75.35it/s] 
5301it [00:48, 17.29it/s]
5606it [00:56, 31.58it/s]
5800it [01:11, 19.31it/s]
8972it [01:12, 203.24it/s]
9332it [01:14, 199.81it/s]
9589it [01:22, 119.52it/s]
10400it [01:23, 177.19it/s]
10599it [01:24, 175.19it/s]
10743it [01:26, 149.25it/s]
10743it [01:44, 149.25it/s]
10837it [01:45, 43.81it/s] 
13403it [01:45, 186.55it/s]
13780it [01:47, 188.66it/s]
14050it [01:49, 181.34it/s]
14243it [01:50, 171.25it/s]
14382it [02:00, 81.64it/s] 
15355it [02:00, 164.38it/s]
15623it [02:01, 167.75it/s]
15816it [02:03, 159.41it/s]
15955it [02:03, 164.93it/s]
16059it [02:05, 129.79it/s]
16253it [02:06, 140.69it/s]
16412it [02:07, 176.46it/s]
16497it [02:07, 169.16it/s]
16561it [02:08, 166.26it/s]
16610it [02:08, 171.25it/s]
16651it [02:08, 168.43it/s]
16684it [02:08, 163.07it/s]
16712it [02:09, 141.17it/s]
16751it [02:09, 133.29it/s]
16787it [02:09, 139.70it/s]
16817it [02:10, 138.25it/s]
16834it [02:10, 121.21it/s]
16885it [02:10, 150.68it/s]
16937it [02:10, 163.97it/s]
16991it [02:10, 208.06it/s]
17017it [02:11, 109.32it/s]
17104it [02:11, 160.18it/s]
17127it [02:12, 145.56it/s]
17187it [02:14, 46.36it/s] 
17585it [02:15, 130.00it/s]
2025-02-28 16:54:02 | unimol_tools/data/conformer.py | 135 | INFO | Uni-Mol Tools | Succeeded in generating conformers for 100.00% of molecules.
2025-02-28 16:54:02 | unimol_tools/data/conformer.py | 142 | INFO | Uni-Mol Tools | Succeeded in generating 3d conformers for 99.78% of molecules.
2025-02-28 16:54:02 | unimol_tools/data/conformer.py | 145 | INFO | Uni-Mol Tools | Failed 3d conformers indices: [41, 273, 437, 805, 892, 918, 1210, 2435, 2649, 3081, 3501, 4133, 5075, 5296, 5605, 5799, 6575, 6601, 6639, 7738, 7959, 8644, 9528, 10250, 10593, 10836, 11039, 11041, 12041, 12313, 12705, 13519, 14296, 14901, 15596, 15933, 16996, 17262]
2025-02-28 16:54:02 | unimol_tools/data/datahub.py | 112 | INFO | Uni-Mol Tools | Split method: random, fold: 5
2025-02-28 16:54:02 | unimol_tools/train.py | 202 | INFO | Uni-Mol Tools | Output directory already exists: ./mp_train
2025-02-28 16:54:02 | unimol_tools/train.py | 203 | INFO | Uni-Mol Tools | Warning: Overwrite output directory: ./mp_train
2025-02-28 16:54:02 | unimol_tools/models/unimol.py | 120 | INFO | Uni-Mol Tools | Loading pretrained weights from /opt/mamba/lib/python3.10/site-packages/unimol_tools/weights/mol_pre_all_h_220816.pt
2025-02-28 16:54:03 | unimol_tools/models/nnmodel.py | 144 | INFO | Uni-Mol Tools | start training Uni-Mol:unimolv1
2025-02-28 16:54:55 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [1/20] train_loss: 0.4635, val_loss: 0.4275, val_r2: 0.5825, lr: 0.000098, 52.0s
2025-02-28 16:55:47 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [2/20] train_loss: 0.2808, val_loss: 0.2189, val_r2: 0.7861, lr: 0.000093, 51.9s
2025-02-28 16:57:32 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [4/20] train_loss: 0.1856, val_loss: 0.1807, val_r2: 0.8234, lr: 0.000082, 52.2s
2025-02-28 16:58:25 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [5/20] train_loss: 0.1532, val_loss: 0.1948, val_r2: 0.8096, lr: 0.000077, 52.5s
2025-02-28 17:01:04 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [8/20] train_loss: 0.0823, val_loss: 0.1748, val_r2: 0.8293, lr: 0.000062, 53.2s
2025-02-28 17:01:59 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [9/20] train_loss: 0.0684, val_loss: 0.1888, val_r2: 0.8155, lr: 0.000057, 55.4s
2025-02-28 17:02:54 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [10/20] train_loss: 0.0548, val_loss: 0.2044, val_r2: 0.8003, lr: 0.000052, 54.3s
2025-02-28 17:03:48 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [11/20] train_loss: 0.0453, val_loss: 0.1767, val_r2: 0.8274, lr: 0.000046, 53.9s
Train:  34%|███▎      | 296/879 [00:16<00:32, 18.07it/s, Epoch=Epoch 12/20, loss=0.0364, lr=0.0000]
代码
文本

Step5: 预测熔点

代码
文本
[ ]
# 通过画出实验值和预测值来可视化我们的模型训练结果,比对测试集的实验值和预测值
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

clf = MolPredict(load_model='./mp_train') #加载训练好的模型
predict = clf.predict('./mp_test_1.csv').reshape(-1)

test_set = pd.read_csv("./mp_test_1.csv",header='infer') #读取实验数据文件
test_mp = test_set["TARGET"].to_numpy() #提取eps值

# 计算预测值和实验值的范围,用于设定图像的坐标轴范围
xmin = min(predict.flatten().min(), test_mp.min())
xmax = max(predict.flatten().max(), test_mp.max())
ymin = xmin
ymax = xmax
代码
文本
[ ]
# 设置图像大小
plt.figure(figsize=(7, 6))

# 设置x轴和y轴的范围,
plt.xlim(xmin, xmax)
plt.ylim(ymin, ymax)

# 添加X轴和Y轴标签
plt.xlabel('Predicted Melting Point', fontsize=14)
plt.ylabel('Experimental Melting Point', fontsize=14)

# 添加标题
plt.title('Experimental vs Predicted Melting Point', fontsize=16)

# 绘制散点图
plt.scatter(predict, test_mp, color='blue', alpha=0.6)

# 绘制y=x的直线
x = np.linspace(xmin, xmax)
plt.plot(x, x, color='red', linestyle='--', linewidth=2)

# 显示图形
plt.show()
代码
文本

练习

可以尝试用整个训练集和测试集再跑一轮,比较两次预测效果。

代码
文本
Uni-Mol
Deep Learning
中文
Uni-MolDeep Learning中文
已赞3