Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
RDKit Input for Building AI Models - Taking Uni-Mol as an Example
RDKit
RDKit
Yani Guan
更新于 2024-10-24
推荐镜像 :Basic Image:bohrium-notebook:2023-04-07
推荐机型 :c2_m4_cpu
AI4SCUP-CNS-BBB(v1)

RDKit Input for Building AI Models - Taking Uni-Mol as an Example

©️ Copyright 2023 @ Authors
Author: He Yang 📨
Date: 2023-06-13
License Agreement: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Quick Start: Click the Start Connection button above, select the bohrium-notebook:05-31 image and any node configuration, and wait a moment to run.

代码
文本

For artificial intelligence models, an accurate dataset is essential for model training. As a mainstream open-source software in the field of cheminformatics, RDKit is used by many researchers in the construction of datasets.

Uni-Mol is a 3D molecular pre-training model launched by DP Technology. Uni-Mol directly takes the 3D structure of molecules as model input, rather than using 1D sequences or 2D graph structures. The representation learning based on 3D information allows Uni-Mol to surpass SOTA (state of the art) in almost all downstream tasks related to drug molecules and protein pockets. It also enables Uni-Mol to directly accomplish tasks related to 3D conformation generation, such as molecular conformation generation and protein-ligand binding conformation prediction, surpassing existing solutions. here is the Uni-Mol GitHub link.

Next, we will take Uni-Mol as an example to learn about the role of RDKit in dataset construction by browsing through some of its code.

代码
文本
[4]
from rdkit import Chem
import numpy as np
代码
文本
[5]
# Defines a function that searches for and returns the index of an element in a list. If the element is not found in the list, it returns the last index value of the list.
def safe_index(l, e):
"""
Return index of element e in list l. If e is not present, return the last index
"""
try:
return l.index(e)
except:
return len(l) - 1
代码
文本

If we print out the inputs of the model, we will find that these inputs are vectors of a constant length.
The length of the vectors represents the number of features chosen by researchers when constructing the dataset.
In Uni-Mol, nine types of features are used to construct atomic feature vectors, which are:

  • Atomic number,
  • Atomic chirality,
  • Number of atomic bonds,
  • Atomic charge,
  • Number of neighboring hydrogen atoms,
  • Number of non-bonding electrons,
  • Hybridization state,
  • Aromaticity,
  • Whether it is part of a ring.

The input vector length of Uni-Mol chemical bond nodes is 3, which means 3 types of features are selected:

  • Bond type,
  • Bond stereochemistry,
  • Whether it is a conjugated bond.

We store these features and their corresponding values in a dictionary called allowable_feature for easy management.

代码
文本
[6]
allowable_features = {
"possible_atomic_num_list": list(range(1, 119)) + ["misc"],
"possible_chirality_list": [
"CHI_UNSPECIFIED",
"CHI_TETRAHEDRAL_CW",
"CHI_TETRAHEDRAL_CCW",
"CHI_TRIGONALBIPYRAMIDAL",
"CHI_OCTAHEDRAL",
"CHI_SQUAREPLANAR",
"CHI_OTHER",
],
"possible_degree_list": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, "misc"],
"possible_formal_charge_list": [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, "misc"],
"possible_numH_list": [0, 1, 2, 3, 4, 5, 6, 7, 8, "misc"],
"possible_number_radical_e_list": [0, 1, 2, 3, 4, "misc"],
"possible_hybridization_list": ["SP", "SP2", "SP3", "SP3D", "SP3D2", "misc"],
"possible_is_aromatic_list": [False, True],
"possible_is_in_ring_list": [False, True],
"possible_bond_type_list": ["SINGLE", "DOUBLE", "TRIPLE", "AROMATIC", "misc"],
"possible_bond_stereo_list": [
"STEREONONE",
"STEREOZ",
"STEREOE",
"STEREOCIS",
"STEREOTRANS",
"STEREOANY",
],
"possible_is_conjugated_list": [False, True],
}
代码
文本

Once the feature types are determined, we can use RDKit to traverse each atom in the compound and then calculate the value of each atom relative to each feature. This way, we construct the atom-level input vector for the compound.

代码
文本
[7]
def atom_to_feature_vector(atom):
"""
Converts rdkit atom object to feature list of indices
:param mol: rdkit atom object
:return: list
"""
atom_feature = [
safe_index(allowable_features["possible_atomic_num_list"], atom.GetAtomicNum()),
allowable_features["possible_chirality_list"].index(str(atom.GetChiralTag())),
safe_index(allowable_features["possible_degree_list"], atom.GetTotalDegree()),
safe_index(
allowable_features["possible_formal_charge_list"], atom.GetFormalCharge()
),
safe_index(allowable_features["possible_numH_list"], atom.GetTotalNumHs()),
safe_index(
allowable_features["possible_number_radical_e_list"],
atom.GetNumRadicalElectrons(),
),
safe_index(
allowable_features["possible_hybridization_list"],
str(atom.GetHybridization()),
),
allowable_features["possible_is_aromatic_list"].index(atom.GetIsAromatic()),
allowable_features["possible_is_in_ring_list"].index(atom.IsInRing()),
]
return atom_feature
代码
文本

Similarly, using RDKit to traverse each chemical bond in the compound, calculate the value of the chemical bond relative to each feature, thus constructing the input vector of the compound at the chemical bond level.

代码
文本
[8]
def bond_to_feature_vector(bond):
"""
Converts rdkit bond object to feature list of indices
:param mol: rdkit bond object
:return: list
"""
bond_feature = [
safe_index(
allowable_features["possible_bond_type_list"], str(bond.GetBondType())
),
allowable_features["possible_bond_stereo_list"].index(str(bond.GetStereo())),
allowable_features["possible_is_conjugated_list"].index(bond.GetIsConjugated()),
]
return bond_feature
代码
文本

atom_to_feature_vector() and bond_to_feature_vector() are functions that construct input vectors for individual atoms/chemical bonds. In practical applications, our object of operation is an entire molecule.

Therefore, construct a function that takes an RDKit Mol object as input, and the output is the input vectors of all atoms/chemical bonds in the compound.

代码
文本
[9]
def get_graph(mol):
"""
Converts SMILES string to graph Data object
:input: SMILES string (str)
:return: graph object
"""
atom_features_list = []
for atom in mol.GetAtoms():
atom_features_list.append(atom_to_feature_vector(atom))
x = np.array(atom_features_list, dtype=np.int32)
# bonds
num_bond_features = 3 # bond type, bond stereo, is_conjugated
if len(mol.GetBonds()) > 0: # mol has bonds
edges_list = []
edge_features_list = []
for bond in mol.GetBonds():
i = bond.GetBeginAtomIdx()
j = bond.GetEndAtomIdx()
edge_feature = bond_to_feature_vector(bond)
# add edges in both directions
edges_list.append((i, j))
edge_features_list.append(edge_feature)
edges_list.append((j, i))
edge_features_list.append(edge_feature)
# data.edge_index: Graph connectivity in COO format with shape [2, num_edges]
edge_index = np.array(edges_list, dtype=np.int32).T
# data.edge_attr: Edge feature matrix with shape [num_edges, num_edge_features]
edge_attr = np.array(edge_features_list, dtype=np.int32)

else: # mol has no bonds
edge_index = np.empty((2, 0), dtype=np.int32)
edge_attr = np.empty((0, num_bond_features), dtype=np.int32)
return x, edge_index, edge_attr
代码
文本

We take acetylsalicylic acid as an example to construct the input vector. First, use RDKit to read the compound molecule, then add hydrogen atoms, and finally visually check if the molecular structure is correct.

代码
文本
[10]
from rdkit import Chem

m = Chem.MolFromSmiles("c1cccc(C(=O)O)c1OC(=O)C")
Chem.AddHs(m)
Chem.Draw.MolToImage(m)
代码
文本

After obtaining the complete structure of acetylsalicylic acid, use get_graph() to generate the corresponding input vectors.

代码
文本
[12]
x, edge_index, edge_attr = get_graph(m)
print("Input vectors of the atomic nodes for acetylsalicylic acid:\n%s\n\n Atomic indices of the chemical bonds:\n%s\n\n Input vectors of the chemical bond nodes:\n%s" % (x, edge_index, edge_attr))
乙酰水杨酸的原子节点的输入向量:
[[5 0 3 5 1 0 1 1 1]
 [5 0 3 5 1 0 1 1 1]
 [5 0 3 5 1 0 1 1 1]
 [5 0 3 5 1 0 1 1 1]
 [5 0 3 5 0 0 1 1 1]
 [5 0 3 5 0 0 1 0 0]
 [7 0 1 5 0 0 1 0 0]
 [7 0 2 5 1 0 1 0 0]
 [5 0 3 5 0 0 1 1 1]
 [7 0 2 5 0 0 1 0 0]
 [5 0 3 5 0 0 1 0 0]
 [7 0 1 5 0 0 1 0 0]
 [5 0 4 5 3 0 2 0 0]]

 化学键的原子序号:
[[ 0  1  1  2  2  3  3  4  4  5  5  6  5  7  4  8  8  9  9 10 10 11 10 12
   8  0]
 [ 1  0  2  1  3  2  4  3  5  4  6  5  7  5  8  4  9  8 10  9 11 10 12 10
   0  8]]

 化学键节点的输入向量:
[[3 0 1]
 [3 0 1]
 [3 0 1]
 [3 0 1]
 [3 0 1]
 [3 0 1]
 [3 0 1]
 [3 0 1]
 [0 0 1]
 [0 0 1]
 [1 0 1]
 [1 0 1]
 [0 0 1]
 [0 0 1]
 [3 0 1]
 [3 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [1 0 1]
 [1 0 1]
 [0 0 0]
 [0 0 0]
 [3 0 1]
 [3 0 1]]
代码
文本
RDKit
RDKit
点个赞吧
推荐阅读
公开
RDKit | 构建AI模型的输入——以Uni-Mol为例
RDKit化学信息学
RDKit化学信息学
YangHe
发布于 2023-06-13
3 赞13 转存文件
公开
8. 快速查看整体信息
Pandas
Pandas
xuxh@dp.tech
更新于 2024-08-06
{/**/}