探究

实验室

计算

公开

Uni-Mol性质预测实战-回归任务-有机/电解液分子的熔点预测

Uni-Mol

Deep Learning

中文

Uni-MolDeep Learning中文

陈乐天 Letian Chen

zengboshen@dp.tech

wanghongshuai@dp.tech

更新于 2025-02-28

推荐镜像 :unimol-tools:0.0.1

推荐机型 :c3_m4_1 * NVIDIA T4

Uni-Mol性质预测实战-回归任务-有机/电解液分子的熔点预测

案例背景

Step0 安装Uni-Mol Tools

Step1: 读入数据

Step2: 采样数据

Step3: 数据集分布可视化

Step4: 训练模型

Step5: 预测熔点

练习

Uni-Mol性质预测实战-回归任务-有机/电解液分子的熔点预测

©️ Copyright 2025 @ Authors
作者： 曾博深 📨 汪鸿帅 📨 陈乐天 📨
日期：2025-02-28
共享协议：本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始：点击上方的 开始连接 按钮，选择 unimol-tools:0.0.1镜像及任意GPU节点配置，稍等片刻即可运行。

代码

文本

案例背景

熔点（Melting point）用于描述一种物质从固态转变为液态所需要的温度。通常在恒压下，当物质受热而从固态转变为液态时，物质的温度并不会上升，直到所有的固态都已转变为液态后，温度才会继续上升。
在电池领域，电解液分子的熔点是一个衡量其稳定性和可用温度范围的重要物理量。优异的电解液材料要求满足较宽的液程，另外不同的应用场景需要选择具有适当熔点的电解液，以满足特定的性能要求。
通过对未知分子的熔点进行预测，将有助于我们从可能的化学空间中逆向筛选可用作电解液的材料。

代码

文本

Step0 安装Uni-Mol Tools

使用pip install 配置环境并安装 unimol_tools。

使用本案例默认镜像则无需安装与配置环境。

代码

文本

[1]

import os

os.environ['HTTP_PROXY'] = 'http://ga.dp.tech:8118'

os.environ['HTTPS_PROXY'] = 'http://ga.dp.tech:8118'

代码

文本

[2]

# !pip install --upgrade pip

# !pip install torch joblib rdkit pyyaml addict tqdm matplotlib huggingface_hub seaborn

# !pip install numpy==1.22.4 pandas==1.4.0, scikit-learn==1.5.0

已隐藏输出

代码

文本

[3]

# !pip install unimol_tools

已隐藏输出

代码

文本

Step1: 读入数据

包含近2万个分子的SMILES式和熔点测量值数据（TARGET）
TAREGET为连续数值（单位为摄氏度）

代码

文本

[4]

!wget -P ./ https://dp-public.oss-cn-beijing.aliyuncs.com/community/mp_test.csv

!wget -P ./ https://dp-public.oss-cn-beijing.aliyuncs.com/community/mp_train.csv

--2025-02-28 16:44:34--  https://dp-public.oss-cn-beijing.aliyuncs.com/community/mp_test.csv
Resolving ga.dp.tech (ga.dp.tech)... 10.255.254.7, 10.255.254.37, 10.255.254.18
Connecting to ga.dp.tech (ga.dp.tech)|10.255.254.7|:8118... connected.
Proxy request sent, awaiting response... 200 OK
Length: 69686 (68K) [text/csv]
Saving to: ‘./mp_test.csv.3’

mp_test.csv.3       100%[===================>]  68.05K  --.-KB/s    in 0.009s  

2025-02-28 16:44:35 (7.24 MB/s) - ‘./mp_test.csv.3’ saved [69686/69686]

--2025-02-28 16:44:35--  https://dp-public.oss-cn-beijing.aliyuncs.com/community/mp_train.csv
Resolving ga.dp.tech (ga.dp.tech)... 10.255.254.37, 10.255.254.18, 10.255.254.7
Connecting to ga.dp.tech (ga.dp.tech)|10.255.254.37|:8118... connected.
Proxy request sent, awaiting response... 200 OK
Length: 626850 (612K) [text/csv]
Saving to: ‘./mp_train.csv.3’

mp_train.csv.3      100%[===================>] 612.16K  --.-KB/s    in 0.05s   

2025-02-28 16:44:36 (10.9 MB/s) - ‘./mp_train.csv.3’ saved [626850/626850]

代码

文本

Step2: 采样数据

由于我们的熔点数据集较大，在短时间内难以完成演示，这里分别随机采样10%用于训练和测试。
（若要选择所有数据训练，请自行调整数据比例）
若感兴趣可以用完整训练集和测试集，预测效果会更好

代码

文本

[5]

import pandas as pd

train_data_total = pd.read_csv('./mp_train.csv')

#train_data = train_data_total.sample(frac=1, random_state=1) #随机采样10%用于训练

train_data = train_data_total # 使用所有数据训练

print("------------ Sampled Train Data ------------") #显示训练数据

print(train_data)

train_data.columns = ["SMILES", "TARGET"]

train_data.to_csv('./mp_train_1.csv') #将随机采样得到的数据集保存

print('\n')

test_data_total = pd.read_csv('./mp_test.csv')

#test_data = test_data_total.sample(frac=1, random_state=1)

test_data = test_data_total

print("------------ Sampled Test Data ------------")

print(test_data)

test_data.columns = ["SMILES", "TARGET"]

test_data.to_csv('./mp_test_1.csv') #将随机采样得到的数据集保存

------------ Sampled Train Data ------------
                                              SMILES  TARGET
0                           O=S(=O)(Cl)c1ccc(Cl)cc1F   36.00
1                    CN(C)c1cccc2c(S(N)(=O)=O)cccc12  219.75
2              CC(C)(C)C(O)C(Cc1ccc(Cl)cc1Cl)n1cncn1  148.00
3                           BrC(=NNc1ccccc1)c1ccccc1  190.00
4                           O=C(C(=O)c1cccs1)c1cccs1   83.00
...                                              ...     ...
17610  O=C(O)[C@@H]1[C@@H]2C=C[C@@H](C2)[C@H]1C(=O)O  185.00
17611                                  CCCCCC1=CCCC1  -83.00
17612         Clc1nccn1C(c1ccccc1)(c1ccccc1)c1ccccc1  201.00
17613                               O=C1COc2ccccc2N1  174.00
17614                       CCCCOC(=O)CCCCC(=O)OCCCC  -32.40

[17615 rows x 2 columns]


------------ Sampled Test Data ------------
                                      SMILES      TARGET
0     BrC(CCC(Br)C(Br)c1ccccc1)C(Br)c1ccccc1  194.000000
1                      BrC(c1ccccc1)c1ccccc1   41.666667
2                     BrC12CC3CC(CC(C3)C1)C2  119.000000
3                                      BrCBr  -52.625000
4                            BrCC(Br)(Br)CBr   10.750000
...                                      ...         ...
1952                           c1ccc2nonc2c1   54.000000
1953                           c1ccc2nsnc2c1   43.000000
1954                                c1cncnc1   21.333333
1955                       c1coc(C2CNCCN2)c1   87.000000
1956                              c1nc[nH]n1  120.000000

[1957 rows x 2 columns]

代码

文本

Step3: 数据集分布可视化

代码

文本

[6]

import matplotlib.pyplot as plt

bins = 30

plt.figure(figsize=(6, 5))

plt.hist(train_data["TARGET"],label="Train Data")

plt.hist(test_data["TARGET"],label="Test Data")

plt.ylabel("Count")

plt.xlabel("Melting Point (℃)")

plt.title("Distribution")

plt.legend(prop={'size': 12})

plt.tick_params(labelsize=14)

plt.tight_layout()

plt.savefig('./dataset_distribution_histogram.png',

format='png')

代码

文本

Step4: 训练模型

调用 uni-mol 工具对数据进行模型训练

代码

文本

[8]

from unimol_tools import MolTrain,MolPredict

import numpy as np

clf = MolTrain(task='regression', # 回归任务

data_type='molecule', # 数据类型：分子

epochs=20, # 迭代次数，表示模型遍历整个训练数据集的次数。

# 在每个epoch中，模型会根据训练数据进行参数更新，以降低预测误差。

learning_rate=0.0001,

batch_size=16,

early_stopping=5,

metrics='r2',

split='random',

weight_path='/opt/mamba/lib/python3.10/site-packages/unimol_tools/weights//opt/mamba/lib/python3.10/site-packages/unimol_tools/weights/model_4.pth',

save_path='./mp_train', # 模型存储路径

)

clf.fit('./mp_train_1.csv') #训练集文件

clf = MolPredict(load_model='./mp_train') #加载训练好的模型

2025-02-28 16:51:39 | unimol_tools/data/datareader.py | 193 | INFO | Uni-Mol Tools | Anomaly clean with 3 sigma threshold: 17615 -> 17585
Train:  80%|████████  | 705/879 [00:54<00:09, 18.77it/s, Epoch=Epoch 5/20, loss=0.1521, lr=0.0001]2025-02-28 16:51:46 | unimol_tools/data/conformer.py | 126 | INFO | Uni-Mol Tools | Start generating conformers...

0it [00:00, ?it/s]
2it [00:00, 19.17it/s]
6it [00:00, 23.83it/s]
19it [00:00, 46.43it/s]
34it [00:00, 73.60it/s]
43it [00:02, 14.53it/s]
323it [00:02, 204.60it/s]
402it [00:02, 217.77it/s]
402it [00:14, 217.77it/s]
438it [00:18, 14.06it/s] 
919it [00:20, 46.12it/s]
1211it [00:23, 61.41it/s]
2571it [00:23, 217.78it/s]
2888it [00:27, 160.45it/s]
3184it [00:27, 192.06it/s]
3363it [00:28, 184.28it/s]
3493it [00:31, 143.61it/s]
3690it [00:31, 167.07it/s]
3766it [00:31, 181.11it/s]
3836it [00:33, 124.77it/s]
3992it [00:34, 133.14it/s]
4150it [00:34, 176.31it/s]
4202it [00:35, 176.42it/s]
4244it [00:35, 174.10it/s]
4286it [00:35, 186.25it/s]
4319it [00:35, 179.76it/s]
4347it [00:35, 169.58it/s]
4373it [00:36, 107.96it/s]
4391it [00:36, 99.04it/s] 
4435it [00:38, 56.75it/s]
4658it [00:38, 165.98it/s]
4740it [00:38, 197.22it/s]
4776it [00:38, 206.40it/s]
4810it [00:39, 177.01it/s]
4840it [00:39, 175.41it/s]
4864it [00:39, 179.80it/s]
4887it [00:39, 144.07it/s]
4928it [00:40, 120.43it/s]
4967it [00:40, 141.82it/s]
5028it [00:40, 205.16it/s]
5062it [00:40, 226.58it/s]
5094it [00:40, 169.94it/s]
5120it [00:41, 129.67it/s]
5208it [00:41, 229.35it/s]
5246it [00:41, 202.90it/s]
5278it [00:43, 75.35it/s] 
5301it [00:48, 17.29it/s]
5606it [00:56, 31.58it/s]
5800it [01:11, 19.31it/s]
8972it [01:12, 203.24it/s]
9332it [01:14, 199.81it/s]
9589it [01:22, 119.52it/s]
10400it [01:23, 177.19it/s]
10599it [01:24, 175.19it/s]
10743it [01:26, 149.25it/s]
10743it [01:44, 149.25it/s]
10837it [01:45, 43.81it/s] 
13403it [01:45, 186.55it/s]
13780it [01:47, 188.66it/s]
14050it [01:49, 181.34it/s]
14243it [01:50, 171.25it/s]
14382it [02:00, 81.64it/s] 
15355it [02:00, 164.38it/s]
15623it [02:01, 167.75it/s]
15816it [02:03, 159.41it/s]
15955it [02:03, 164.93it/s]
16059it [02:05, 129.79it/s]
16253it [02:06, 140.69it/s]
16412it [02:07, 176.46it/s]
16497it [02:07, 169.16it/s]
16561it [02:08, 166.26it/s]
16610it [02:08, 171.25it/s]
16651it [02:08, 168.43it/s]
16684it [02:08, 163.07it/s]
16712it [02:09, 141.17it/s]
16751it [02:09, 133.29it/s]
16787it [02:09, 139.70it/s]
16817it [02:10, 138.25it/s]
16834it [02:10, 121.21it/s]
16885it [02:10, 150.68it/s]
16937it [02:10, 163.97it/s]
16991it [02:10, 208.06it/s]
17017it [02:11, 109.32it/s]
17104it [02:11, 160.18it/s]
17127it [02:12, 145.56it/s]
17187it [02:14, 46.36it/s] 
17585it [02:15, 130.00it/s]
2025-02-28 16:54:02 | unimol_tools/data/conformer.py | 135 | INFO | Uni-Mol Tools | Succeeded in generating conformers for 100.00% of molecules.
2025-02-28 16:54:02 | unimol_tools/data/conformer.py | 142 | INFO | Uni-Mol Tools | Succeeded in generating 3d conformers for 99.78% of molecules.
2025-02-28 16:54:02 | unimol_tools/data/conformer.py | 145 | INFO | Uni-Mol Tools | Failed 3d conformers indices: [41, 273, 437, 805, 892, 918, 1210, 2435, 2649, 3081, 3501, 4133, 5075, 5296, 5605, 5799, 6575, 6601, 6639, 7738, 7959, 8644, 9528, 10250, 10593, 10836, 11039, 11041, 12041, 12313, 12705, 13519, 14296, 14901, 15596, 15933, 16996, 17262]
2025-02-28 16:54:02 | unimol_tools/data/datahub.py | 112 | INFO | Uni-Mol Tools | Split method: random, fold: 5
2025-02-28 16:54:02 | unimol_tools/train.py | 202 | INFO | Uni-Mol Tools | Output directory already exists: ./mp_train
2025-02-28 16:54:02 | unimol_tools/train.py | 203 | INFO | Uni-Mol Tools | Warning: Overwrite output directory: ./mp_train
2025-02-28 16:54:02 | unimol_tools/models/unimol.py | 120 | INFO | Uni-Mol Tools | Loading pretrained weights from /opt/mamba/lib/python3.10/site-packages/unimol_tools/weights/mol_pre_all_h_220816.pt
2025-02-28 16:54:03 | unimol_tools/models/nnmodel.py | 144 | INFO | Uni-Mol Tools | start training Uni-Mol:unimolv1
2025-02-28 16:54:55 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [1/20] train_loss: 0.4635, val_loss: 0.4275, val_r2: 0.5825, lr: 0.000098, 52.0s
2025-02-28 16:55:47 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [2/20] train_loss: 0.2808, val_loss: 0.2189, val_r2: 0.7861, lr: 0.000093, 51.9s
2025-02-28 16:57:32 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [4/20] train_loss: 0.1856, val_loss: 0.1807, val_r2: 0.8234, lr: 0.000082, 52.2s
2025-02-28 16:58:25 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [5/20] train_loss: 0.1532, val_loss: 0.1948, val_r2: 0.8096, lr: 0.000077, 52.5s
2025-02-28 17:01:04 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [8/20] train_loss: 0.0823, val_loss: 0.1748, val_r2: 0.8293, lr: 0.000062, 53.2s
2025-02-28 17:01:59 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [9/20] train_loss: 0.0684, val_loss: 0.1888, val_r2: 0.8155, lr: 0.000057, 55.4s
2025-02-28 17:02:54 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [10/20] train_loss: 0.0548, val_loss: 0.2044, val_r2: 0.8003, lr: 0.000052, 54.3s
2025-02-28 17:03:48 | unimol_tools/tasks/trainer.py | 208 | INFO | Uni-Mol Tools | Epoch [11/20] train_loss: 0.0453, val_loss: 0.1767, val_r2: 0.8274, lr: 0.000046, 53.9s
Train:  34%|███▎      | 296/879 [00:16<00:32, 18.07it/s, Epoch=Epoch 12/20, loss=0.0364, lr=0.0000]

代码

文本

Step5: 预测熔点

代码

文本

[ ]

# 通过画出实验值和预测值来可视化我们的模型训练结果,比对测试集的实验值和预测值

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

clf = MolPredict(load_model='./mp_train') #加载训练好的模型

predict = clf.predict('./mp_test_1.csv').reshape(-1)

test_set = pd.read_csv("./mp_test_1.csv",header='infer') #读取实验数据文件

test_mp = test_set["TARGET"].to_numpy() #提取eps值

# 计算预测值和实验值的范围，用于设定图像的坐标轴范围

xmin = min(predict.flatten().min(), test_mp.min())

xmax = max(predict.flatten().max(), test_mp.max())

ymin = xmin

ymax = xmax

代码

文本

[ ]

# 设置图像大小

plt.figure(figsize=(7, 6))

# 设置x轴和y轴的范围，

plt.xlim(xmin, xmax)

plt.ylim(ymin, ymax)

# 添加X轴和Y轴标签

plt.xlabel('Predicted Melting Point', fontsize=14)

plt.ylabel('Experimental Melting Point', fontsize=14)

# 添加标题

plt.title('Experimental vs Predicted Melting Point', fontsize=16)

# 绘制散点图

plt.scatter(predict, test_mp, color='blue', alpha=0.6)

# 绘制y=x的直线

x = np.linspace(xmin, xmax)

plt.plot(x, x, color='red', linestyle='--', linewidth=2)

# 显示图形

plt.show()

代码

文本

练习

可以尝试用整个训练集和测试集再跑一轮，比较两次预测效果。

代码

文本

Uni-Mol

Deep Learning

中文

Uni-MolDeep Learning中文

已赞3

本文被以下合集收录

电解液案例

Piloteye

更新于 2025-01-19

4 篇8 人关注

QSAR

Valdsere

更新于 2024-12-11

3 篇0 人关注