RDKit Input for Building AI Models - Taking Uni-Mol as an Example
©️ Copyright 2023 @ Authors
Author:
He Yang 📨
Date: 2023-06-13
License Agreement: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Quick Start: Click the Start Connection button above, select the bohrium-notebook:05-31 image and any node configuration, and wait a moment to run.
For artificial intelligence models, an accurate dataset is essential for model training. As a mainstream open-source software in the field of cheminformatics, RDKit is used by many researchers in the construction of datasets.
Uni-Mol is a 3D molecular pre-training model launched by DP Technology. Uni-Mol directly takes the 3D structure of molecules as model input, rather than using 1D sequences or 2D graph structures. The representation learning based on 3D information allows Uni-Mol to surpass SOTA (state of the art) in almost all downstream tasks related to drug molecules and protein pockets. It also enables Uni-Mol to directly accomplish tasks related to 3D conformation generation, such as molecular conformation generation and protein-ligand binding conformation prediction, surpassing existing solutions. here is the Uni-Mol GitHub link.
Next, we will take Uni-Mol as an example to learn about the role of RDKit in dataset construction by browsing through some of its code.
If we print out the inputs of the model, we will find that these inputs are vectors of a constant length.
The length of the vectors represents the number of features chosen by researchers when constructing the dataset.
In Uni-Mol, nine types of features are used to construct atomic feature vectors, which are:
- Atomic number,
- Atomic chirality,
- Number of atomic bonds,
- Atomic charge,
- Number of neighboring hydrogen atoms,
- Number of non-bonding electrons,
- Hybridization state,
- Aromaticity,
- Whether it is part of a ring.
The input vector length of Uni-Mol chemical bond nodes is 3, which means 3 types of features are selected:
- Bond type,
- Bond stereochemistry,
- Whether it is a conjugated bond.
We store these features and their corresponding values in a dictionary called allowable_feature
for easy management.
Once the feature types are determined, we can use RDKit to traverse each atom in the compound and then calculate the value of each atom relative to each feature. This way, we construct the atom-level input vector for the compound.
Similarly, using RDKit to traverse each chemical bond in the compound, calculate the value of the chemical bond relative to each feature, thus constructing the input vector of the compound at the chemical bond level.
atom_to_feature_vector() and bond_to_feature_vector() are functions that construct input vectors for individual atoms/chemical bonds. In practical applications, our object of operation is an entire molecule.
Therefore, construct a function that takes an RDKit Mol object as input, and the output is the input vectors of all atoms/chemical bonds in the compound.
We take acetylsalicylic acid as an example to construct the input vector. First, use RDKit to read the compound molecule, then add hydrogen atoms, and finally visually check if the molecular structure is correct.
After obtaining the complete structure of acetylsalicylic acid, use get_graph()
to generate the corresponding input vectors.
乙酰水杨酸的原子节点的输入向量: [[5 0 3 5 1 0 1 1 1] [5 0 3 5 1 0 1 1 1] [5 0 3 5 1 0 1 1 1] [5 0 3 5 1 0 1 1 1] [5 0 3 5 0 0 1 1 1] [5 0 3 5 0 0 1 0 0] [7 0 1 5 0 0 1 0 0] [7 0 2 5 1 0 1 0 0] [5 0 3 5 0 0 1 1 1] [7 0 2 5 0 0 1 0 0] [5 0 3 5 0 0 1 0 0] [7 0 1 5 0 0 1 0 0] [5 0 4 5 3 0 2 0 0]] 化学键的原子序号: [[ 0 1 1 2 2 3 3 4 4 5 5 6 5 7 4 8 8 9 9 10 10 11 10 12 8 0] [ 1 0 2 1 3 2 4 3 5 4 6 5 7 5 8 4 9 8 10 9 11 10 12 10 0 8]] 化学键节点的输入向量: [[3 0 1] [3 0 1] [3 0 1] [3 0 1] [3 0 1] [3 0 1] [3 0 1] [3 0 1] [0 0 1] [0 0 1] [1 0 1] [1 0 1] [0 0 1] [0 0 1] [3 0 1] [3 0 1] [0 0 1] [0 0 1] [0 0 1] [0 0 1] [1 0 1] [1 0 1] [0 0 0] [0 0 0] [3 0 1] [3 0 1]]