空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

RDKit Input for Building AI Models - Taking Uni-Mol as an Example

RDKit

Yani Guan

更新于 2024-10-24

推荐镜像 :Basic Image:bohrium-notebook:2023-04-07

推荐机型 :c2_m4_cpu

数据集

AI4SCUP-CNS-BBB(v1)

RDKit Input for Building AI Models - Taking Uni-Mol as an Example

©️ Copyright 2023 @ Authors
Author: He Yang 📨
Date: 2023-06-13
License Agreement: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Quick Start: Click the Start Connection button above, select the bohrium-notebook:05-31 image and any node configuration, and wait a moment to run.

代码

文本

For artificial intelligence models, an accurate dataset is essential for model training. As a mainstream open-source software in the field of cheminformatics, RDKit is used by many researchers in the construction of datasets.

Uni-Mol is a 3D molecular pre-training model launched by DP Technology. Uni-Mol directly takes the 3D structure of molecules as model input, rather than using 1D sequences or 2D graph structures. The representation learning based on 3D information allows Uni-Mol to surpass SOTA (state of the art) in almost all downstream tasks related to drug molecules and protein pockets. It also enables Uni-Mol to directly accomplish tasks related to 3D conformation generation, such as molecular conformation generation and protein-ligand binding conformation prediction, surpassing existing solutions. here is the Uni-Mol GitHub link.

Next, we will take Uni-Mol as an example to learn about the role of RDKit in dataset construction by browsing through some of its code.

代码

文本

[4]

from rdkit import Chem

import numpy as np

代码

文本

[5]

# Defines a function that searches for and returns the index of an element in a list. If the element is not found in the list, it returns the last index value of the list.

def safe_index(l, e):

"""

Return index of element e in list l. If e is not present, return the last index

"""

try:

return l.index(e)

except:

return len(l) - 1

代码

文本

If we print out the inputs of the model, we will find that these inputs are vectors of a constant length.
The length of the vectors represents the number of features chosen by researchers when constructing the dataset.
In Uni-Mol, nine types of features are used to construct atomic feature vectors, which are:

Atomic number,
Atomic chirality,
Number of atomic bonds,
Atomic charge,
Number of neighboring hydrogen atoms,
Number of non-bonding electrons,
Hybridization state,
Aromaticity,
Whether it is part of a ring.

The input vector length of Uni-Mol chemical bond nodes is 3, which means 3 types of features are selected:

Bond type,
Bond stereochemistry,
Whether it is a conjugated bond.

We store these features and their corresponding values in a dictionary called allowable_feature for easy management.

代码

文本

[6]

allowable_features = {

"possible_atomic_num_list": list(range(1, 119)) + ["misc"],

"possible_chirality_list": [

"CHI_UNSPECIFIED",

"CHI_TETRAHEDRAL_CW",

"CHI_TETRAHEDRAL_CCW",

"CHI_TRIGONALBIPYRAMIDAL",

"CHI_OCTAHEDRAL",

"CHI_SQUAREPLANAR",

"CHI_OTHER",

"possible_degree_list": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, "misc"],

"possible_formal_charge_list": [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, "misc"],

"possible_numH_list": [0, 1, 2, 3, 4, 5, 6, 7, 8, "misc"],

"possible_number_radical_e_list": [0, 1, 2, 3, 4, "misc"],

"possible_hybridization_list": ["SP", "SP2", "SP3", "SP3D", "SP3D2", "misc"],

"possible_is_aromatic_list": [False, True],

"possible_is_in_ring_list": [False, True],

"possible_bond_type_list": ["SINGLE", "DOUBLE", "TRIPLE", "AROMATIC", "misc"],

"possible_bond_stereo_list": [

"STEREONONE",

"STEREOZ",

"STEREOE",

"STEREOCIS",

"STEREOTRANS",

"STEREOANY",

"possible_is_conjugated_list": [False, True],

}

代码

文本

Once the feature types are determined, we can use RDKit to traverse each atom in the compound and then calculate the value of each atom relative to each feature. This way, we construct the atom-level input vector for the compound.

代码

文本

[7]

def atom_to_feature_vector(atom):

"""

Converts rdkit atom object to feature list of indices

:param mol: rdkit atom object

:return: list

"""

atom_feature = [

safe_index(allowable_features["possible_atomic_num_list"], atom.GetAtomicNum()),

allowable_features["possible_chirality_list"].index(str(atom.GetChiralTag())),

safe_index(allowable_features["possible_degree_list"], atom.GetTotalDegree()),

safe_index(

allowable_features["possible_formal_charge_list"], atom.GetFormalCharge()

safe_index(allowable_features["possible_numH_list"], atom.GetTotalNumHs()),

safe_index(

allowable_features["possible_number_radical_e_list"],

atom.GetNumRadicalElectrons(),

safe_index(

allowable_features["possible_hybridization_list"],

str(atom.GetHybridization()),

allowable_features["possible_is_aromatic_list"].index(atom.GetIsAromatic()),

allowable_features["possible_is_in_ring_list"].index(atom.IsInRing()),

]

return atom_feature

代码

文本

Similarly, using RDKit to traverse each chemical bond in the compound, calculate the value of the chemical bond relative to each feature, thus constructing the input vector of the compound at the chemical bond level.

代码

文本

[8]

def bond_to_feature_vector(bond):

"""

Converts rdkit bond object to feature list of indices

:param mol: rdkit bond object

:return: list

"""

bond_feature = [

safe_index(

allowable_features["possible_bond_type_list"], str(bond.GetBondType())

allowable_features["possible_bond_stereo_list"].index(str(bond.GetStereo())),

allowable_features["possible_is_conjugated_list"].index(bond.GetIsConjugated()),

]

return bond_feature

代码

文本

atom_to_feature_vector() and bond_to_feature_vector() are functions that construct input vectors for individual atoms/chemical bonds. In practical applications, our object of operation is an entire molecule.

Therefore, construct a function that takes an RDKit Mol object as input, and the output is the input vectors of all atoms/chemical bonds in the compound.

代码

文本

[9]

def get_graph(mol):

"""

Converts SMILES string to graph Data object

:input: SMILES string (str)

:return: graph object

"""

atom_features_list = []

for atom in mol.GetAtoms():

atom_features_list.append(atom_to_feature_vector(atom))

x = np.array(atom_features_list, dtype=np.int32)

# bonds

num_bond_features = 3 # bond type, bond stereo, is_conjugated

if len(mol.GetBonds()) > 0: # mol has bonds

edges_list = []

edge_features_list = []

for bond in mol.GetBonds():

i = bond.GetBeginAtomIdx()

j = bond.GetEndAtomIdx()

edge_feature = bond_to_feature_vector(bond)

# add edges in both directions

edges_list.append((i, j))

edge_features_list.append(edge_feature)

edges_list.append((j, i))

edge_features_list.append(edge_feature)

# data.edge_index: Graph connectivity in COO format with shape [2, num_edges]

edge_index = np.array(edges_list, dtype=np.int32).T

# data.edge_attr: Edge feature matrix with shape [num_edges, num_edge_features]

edge_attr = np.array(edge_features_list, dtype=np.int32)

else: # mol has no bonds

edge_index = np.empty((2, 0), dtype=np.int32)

edge_attr = np.empty((0, num_bond_features), dtype=np.int32)

return x, edge_index, edge_attr

代码

文本

We take acetylsalicylic acid as an example to construct the input vector. First, use RDKit to read the compound molecule, then add hydrogen atoms, and finally visually check if the molecular structure is correct.

代码

文本

[10]

from rdkit import Chem

m = Chem.MolFromSmiles("c1cccc(C(=O)O)c1OC(=O)C")

Chem.AddHs(m)

Chem.Draw.MolToImage(m)

代码

文本

After obtaining the complete structure of acetylsalicylic acid, use get_graph() to generate the corresponding input vectors.

代码

文本

[12]

x, edge_index, edge_attr = get_graph(m)

print("Input vectors of the atomic nodes for acetylsalicylic acid:\n%s\n\n Atomic indices of the chemical bonds:\n%s\n\n Input vectors of the chemical bond nodes:\n%s" % (x, edge_index, edge_attr))

乙酰水杨酸的原子节点的输入向量：
[[5 0 3 5 1 0 1 1 1]
 [5 0 3 5 1 0 1 1 1]
 [5 0 3 5 1 0 1 1 1]
 [5 0 3 5 1 0 1 1 1]
 [5 0 3 5 0 0 1 1 1]
 [5 0 3 5 0 0 1 0 0]
 [7 0 1 5 0 0 1 0 0]
 [7 0 2 5 1 0 1 0 0]
 [5 0 3 5 0 0 1 1 1]
 [7 0 2 5 0 0 1 0 0]
 [5 0 3 5 0 0 1 0 0]
 [7 0 1 5 0 0 1 0 0]
 [5 0 4 5 3 0 2 0 0]]

 化学键的原子序号：
[[ 0  1  1  2  2  3  3  4  4  5  5  6  5  7  4  8  8  9  9 10 10 11 10 12
   8  0]
 [ 1  0  2  1  3  2  4  3  5  4  6  5  7  5  8  4  9  8 10  9 11 10 12 10
   0  8]]

 化学键节点的输入向量:
[[3 0 1]
 [3 0 1]
 [3 0 1]
 [3 0 1]
 [3 0 1]
 [3 0 1]
 [3 0 1]
 [3 0 1]
 [0 0 1]
 [0 0 1]
 [1 0 1]
 [1 0 1]
 [0 0 1]
 [0 0 1]
 [3 0 1]
 [3 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [1 0 1]
 [1 0 1]
 [0 0 0]
 [0 0 0]
 [3 0 1]
 [3 0 1]]

代码

文本

reference

https://github.com/dptech-corp/Uni-Mol

代码

文本

RDKit

点个赞吧