新建
基于Uni-Mol的分子&原子级别向量表征

Zhifeng Gao


Letian

推荐镜像 :Basic Image:bohrium-notebook:2023-04-07
推荐机型 :c3_m4_1 * NVIDIA T4
赞 4
7
8
目录
基于Uni-Mol的分子向量表征
什么是分子/原子向量表征
分子/原子级别的向量表征是将原子或分子的化学和物理特性表示为数学向量的过程,在化学信息学、计算化学、药物设计和材料科学中具有重要意义。
原子级别的向量表征包括原子类型、电负性、原子半径、电子层结构、电荷分布和键类型等信息。例如,一个碳原子的向量可能是[6, 2.55, 0.77, [2, 4], 0.0, 1]。
分子级别的向量表征则涵盖分子结构信息、电子性质、几何结构、光谱性质、热力学性质和反应活性等。例如,一个水分子的向量可能是[18.015, 1.85, [0.957, 104.5], [0.34, 0.17, 1.85], -75.0]。
这种表征方法在机器学习模型、分子相似性搜索和化学反应预测等领域应用广泛,为定量分析和计算提供了重要基础,促进了化学和生物学的研究与应用。
代码
文本
导入Uni-Mol
代码
文本
Uni-Mol tools目前已经能够使用pip install来安装,运行下面的代码,一键安装。
代码
文本
[1]
!pip install unimol_tools
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting unimol_tools Downloading https://pypi.tuna.tsinghua.edu.cn/packages/49/02/01b92f2a35425ccfd7675bf3ab6f0a45e6b0e9ff3e95c420ae062801af66/unimol_tools-0.1.0.post4-py3-none-any.whl (51 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51.1/51.1 kB 1.3 MB/s eta 0:00:00 Collecting addict Downloading https://pypi.tuna.tsinghua.edu.cn/packages/6a/00/b08f23b7d7e1e14ce01419a467b583edbb93c6cdb8654e54a9cc579cd61f/addict-2.4.0-py3-none-any.whl (3.8 kB) Requirement already satisfied: tqdm in /opt/conda/lib/python3.8/site-packages (from unimol_tools) (4.64.1) Collecting rdkit Downloading https://pypi.tuna.tsinghua.edu.cn/packages/3d/84/63b2e66f5c7cb97ce994769afbbef85a1ac364fedbcb7d4a3c0f15d318a5/rdkit-2024.3.5-cp38-cp38-manylinux_2_28_x86_64.whl (33.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33.1/33.1 MB 2.7 MB/s eta 0:00:0000:0100:01 Requirement already satisfied: torch in /opt/conda/lib/python3.8/site-packages (from unimol_tools) (1.13.1+cu116) Requirement already satisfied: numpy<2.0.0,>=1.22.4 in /opt/conda/lib/python3.8/site-packages (from unimol_tools) (1.22.4) Requirement already satisfied: pandas<2.0.0 in /opt/conda/lib/python3.8/site-packages (from unimol_tools) (1.5.3) Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.8/site-packages (from unimol_tools) (1.0.2) Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from unimol_tools) (1.2.0) Requirement already satisfied: pyyaml in /opt/conda/lib/python3.8/site-packages (from unimol_tools) (6.0) Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.8/site-packages (from pandas<2.0.0->unimol_tools) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.8/site-packages (from pandas<2.0.0->unimol_tools) (2022.7) Requirement already satisfied: Pillow in /opt/conda/lib/python3.8/site-packages (from rdkit->unimol_tools) (9.4.0) Requirement already satisfied: scipy>=1.1.0 in /opt/conda/lib/python3.8/site-packages (from scikit-learn->unimol_tools) (1.7.3) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.8/site-packages (from scikit-learn->unimol_tools) (3.1.0) Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.8/site-packages (from torch->unimol_tools) (4.5.0) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas<2.0.0->unimol_tools) (1.16.0) Installing collected packages: addict, rdkit, unimol_tools Successfully installed addict-2.4.0 rdkit-2024.3.5 unimol_tools-0.1.0.post4 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
代码
文本
[3]
#考虑库不兼容的问题,这里先升级一下numpy
!pip install --upgrade numpy
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (1.24.4) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
代码
文本
因为需要从hugging face上下载,这里先配置一下ip
代码
文本
[4]
import os
os.environ['HTTP_PROXY'] = 'http://ga.dp.tech:8118'
os.environ['HTTPS_PROXY'] = 'http://ga.dp.tech:8118'
代码
文本
[5]
from unimol_tools import UniMolRepr
import numpy as np
import pandas as pd
# single smiles unimol representation
clf = UniMolRepr(data_type='molecule', remove_hs=False)
smiles = 'c1ccc(cc1)C2=NCC(=O)Nc3c2cc(cc3)[N+](=O)[O]'
smiles_list = [smiles]
unimol_repr = clf.get_repr(smiles_list, return_atomic_reprs=True)
# CLS token repr
print(np.array(unimol_repr['cls_repr']).shape)
# atomic level repr, align with rdkit mol.GetAtoms()
print(np.array(unimol_repr['atomic_reprs']).shape)
/opt/conda/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" 2024-10-30 16:58:52 | unimol_tools/weights/weighthub.py | 17 | INFO | Uni-Mol Tools | Weights will be downloaded to default directory: /opt/conda/lib/python3.8/site-packages/unimol_tools/weights 2024-10-30 16:58:53 | unimol_tools/weights/weighthub.py | 33 | INFO | Uni-Mol Tools | Downloading mol_pre_all_h_220816.pt
2024-10-30 16:59:15 | unimol_tools/weights/weighthub.py | 33 | INFO | Uni-Mol Tools | Downloading mol.dict.txt
2024-10-30 16:59:17 | unimol_tools/models/unimol.py | 120 | INFO | Uni-Mol Tools | Loading pretrained weights from /opt/conda/lib/python3.8/site-packages/unimol_tools/weights/mol_pre_all_h_220816.pt 2024-10-30 16:59:21 | unimol_tools/data/conformer.py | 89 | INFO | Uni-Mol Tools | Start generating conformers... 1it [00:00, 13.71it/s] 2024-10-30 16:59:21 | unimol_tools/data/conformer.py | 93 | INFO | Uni-Mol Tools | Succeed to generate conformers for 100.00% of molecules. 2024-10-30 16:59:21 | unimol_tools/data/conformer.py | 95 | INFO | Uni-Mol Tools | Succeed to generate 3d conformers for 100.00% of molecules. 100%|██████████| 1/1 [00:01<00:00, 1.18s/it](1, 512) (1, 32, 512)
代码
文本
[6]
%%bash
# 下载样例数据, CNS drug data
rm -rf mol_train.csv
wget -nv https://bohrium-example.oss-cn-zhangjiakou.aliyuncs.com/unimol-qsar/mol_train.csv
2024-10-30 16:59:26 URL:https://bohrium-example.oss-cn-zhangjiakou.aliyuncs.com/unimol-qsar/mol_train.csv [30600/30600] -> "mol_train.csv" [1]
代码
文本
[7]
smiles_list = pd.read_csv('mol_train.csv')['SMILES'].to_list()
y = pd.read_csv('mol_train.csv')['TARGET'].to_list()
repr_dict = clf.get_repr(smiles_list)
unimol_repr_list = np.array(repr_dict['cls_repr'])
2024-10-30 16:59:31 | unimol_tools/data/conformer.py | 89 | INFO | Uni-Mol Tools | Start generating conformers... 700it [00:39, 17.85it/s] 2024-10-30 17:00:10 | unimol_tools/data/conformer.py | 93 | INFO | Uni-Mol Tools | Succeed to generate conformers for 100.00% of molecules. 2024-10-30 17:00:10 | unimol_tools/data/conformer.py | 95 | INFO | Uni-Mol Tools | Succeed to generate 3d conformers for 100.00% of molecules. 100%|██████████| 22/22 [00:07<00:00, 2.96it/s]
代码
文本
[8]
print(unimol_repr_list.shape)
(700, 512)
代码
文本
[9]
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
代码
文本
[10]
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(unimol_repr_list)
代码
文本
[11]
# 可视化
colors = ['r', 'g', 'b']
markers = ['s', 'o', 'D']
labels = ['Target:0','Target:1']
plt.figure(figsize=(8, 6))
for label, color, marker in zip(np.unique(y), colors, markers):
plt.scatter(X_reduced[y == label, 0],
X_reduced[y == label, 1],
c=color,
marker=marker,
label=labels[label],
edgecolors='black')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='best')
plt.title('Unimol Repr')
plt.show()
代码
文本
[ ]
代码
文本
已赞4
本文被以下合集收录
Uni-Mol最佳实践

Zhifeng Gao

更新于 2024-10-28
5 篇7 人关注
uni-mol

Nshima

更新于 2024-01-23
1 篇0 人关注
推荐阅读
公开
数值原子轨道(三):产生高精度数值原子轨道
shimengchao@dp.tech

发布于 2023-08-19
1 转存文件
公开
Week 1 | 文献调研
yanjin

发布于 2023-10-31