空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

Quantitative Structure-Activity Relationship (QSAR) Model from 0 to 1 (Classification Tasks)

Uni-Mol

Yani Guan

更新于 2024-10-24

推荐镜像 :Uni-Mol:unimol-qsar:v0.5

推荐机型 :c3_m4_1 * NVIDIA T4

Quantitative Structure-Activity Relationship (QSAR) Model from 0 to 1 & Uni-Mol Introductory Practice. Classification Tasks

Introduction

Let's Prepare Some Data!

A Brief History of QSAR

Basic Requirements for QSAR Modeling

Basic Workflow of QSAR Modeling

Molecular Representation

1D-QSAR Molecular Representation

2D-QSAR Molecular Characterization

3D-QSAR Molecular Characterization

Uni-Mol Molecular Representation Learning and Pretraining Framework

Pretraining Model

Introduction to Uni-Mol

Overview of Results

Quantitative Structure-Activity Relationship (QSAR) Model from 0 to 1 & Uni-Mol Introductory Practice. Classification Tasks

©️ Copyright 2023 @ Authors
Author: Hang Zheng (郑行) 📨
Date: 2023-06-06
License: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Quick Start: Click the Start Connection button above, select the unimol-qsar:0703 image and any GPU node configuration, and wait a moment to run.

代码

文本

In recent years, Artificial Intelligence (AI) has been developing at an unprecedented speed, bringing significant breakthroughs and transformations to various fields.

In fact, in the field of drug development, drug scientists have been using a series of mathematical and statistical methods to aid the drug development process since the last century. Based on the structure of drug molecules, they construct mathematical simulations to predict the biochemical activity of drugs. This method is known as Quantitative Structure-Activity Relationship (QSAR). QSAR models have continued to evolve with the deepening research on drug molecules and the introduction of more AI methods.

It can be said that QSAR models are a good microcosm of the development of the AI for Science field. In this Notebook, we will introduce the construction methods of different types of QSAR models in the form of case studies.

代码

文本

Introduction

Quantitative Structure-Activity Relationship (QSAR) is a method that studies the quantitative relationship between the chemical structure of compounds and their biological activity. It is one of the most important tools in Computer-Aided Drug Design (CADD). QSAR aims to establish mathematical models to relate molecular structures with their biochemical and physicochemical properties, helping drug scientists to make rational predictions about the properties of new drug molecules.

Building an effective QSAR model involves several steps:

Constructing a reasonable molecular representation, which converts molecular structures into computer-readable numerical representations;
Selecting a suitable machine learning model for the molecular representation and using existing molecule-property data to train the model;
Using the trained machine learning model to predict the properties of molecules with unknown properties.

The development of QSAR models has evolved with the progression of molecular representation techniques and the corresponding upgrades in machine learning models. In this notebook, we will introduce the construction methods of different types of QSAR models through case studies.

代码

文本

Let's Prepare Some Data!

To better guide everyone through the process of building QSAR models, we will use the BACE-1 target molecular activity prediction task as a demonstration case.

We can start by downloading the BACE-1 dataset:

代码

文本

[10]

import os

os.makedirs("datasets", exist_ok=True)

!wget https://bohrium-example.oss-cn-zhangjiakou.aliyuncs.com/notebook/bace_classification/train.csv -O ./datasets/BACE_train.csv

!wget https://bohrium-example.oss-cn-zhangjiakou.aliyuncs.com/notebook/bace_classification/test.csv -O ./datasets/BACE_test.csv

--2024-10-24 06:23:09--  https://bohrium-example.oss-cn-zhangjiakou.aliyuncs.com/notebook/bace_classification/train.csv
Resolving ga.dp.tech (ga.dp.tech)... 10.255.254.37, 10.255.254.18, 10.255.254.7
Connecting to ga.dp.tech (ga.dp.tech)|10.255.254.37|:8118... connected.
Proxy request sent, awaiting response... 200 OK
Length: 81046 (79K) [text/csv]
Saving to: ‘./datasets/BACE_train.csv’

./datasets/BACE_tra 100%[===================>]  79.15K  --.-KB/s    in 0.006s  

2024-10-24 06:23:09 (12.3 MB/s) - ‘./datasets/BACE_train.csv’ saved [81046/81046]

--2024-10-24 06:23:10--  https://bohrium-example.oss-cn-zhangjiakou.aliyuncs.com/notebook/bace_classification/test.csv
Resolving ga.dp.tech (ga.dp.tech)... 10.255.254.37, 10.255.254.18, 10.255.254.7
Connecting to ga.dp.tech (ga.dp.tech)|10.255.254.37|:8118... connected.
Proxy request sent, awaiting response... 200 OK
Length: 22128 (22K) [text/csv]
Saving to: ‘./datasets/BACE_test.csv’

./datasets/BACE_tes 100%[===================>]  21.61K  --.-KB/s    in 0.002s  

2024-10-24 06:23:10 (12.9 MB/s) - ‘./datasets/BACE_test.csv’ saved [22128/22128]

代码

文本

Next, we can take a look at the composition of this dataset:

代码

文本

[11]

import pandas as pd

train_data = pd.read_csv("./datasets/BACE_train.csv")

train_data.columns=["SMILES", "TARGET"]

train_data.to_csv("./datasets/BACE_train.csv", index=False)

test_data = pd.read_csv("./datasets/BACE_test.csv")

test_data.columns=["SMILES", "TARGET"]

test_data.to_csv("./datasets/BACE_test.csv", index=False)

print("------ train data ------")

print(train_data)

print("POSITIVE:", sum(train_data["TARGET"]==1))

print("NEGETIVE:", sum(train_data["TARGET"]==0))

print("------ test data ------")

print(test_data)

print("POSITIVE:", sum(test_data["TARGET"]==1))

print("NEGETIVE:", sum(test_data["TARGET"]==0))

------ train data ------
                                                 SMILES  TARGET
0     CN1C(=O)[C@@](c2ccc(OC(F)F)cc2)(c2ccc(F)c(C#CC...       1
1     CN1C(=O)[C@@](c2ccc(OC(F)F)cc2)(c2cccc(C#CCF)c...       1
2     CN1C(=O)[C@@](c2ccc(OC(F)F)cc2)(c2cccc(C#CCCF)...       1
3     CN1C(=O)[C@@](c2ccc(OC(F)F)cc2)(c2cccc(C#CCCO)...       1
4     CN1C(=O)[C@@](c2ccc(OC(F)F)cc2)(c2cccc(C#CCCCC...       1
...                                                 ...     ...
1205       CN1C(=O)C(c2ccncc2)(c2cccc(-c3ccoc3)c2)N=C1N       0
1206  CC(=O)N[C@@H](Cc1cc(F)cc(F)c1)[C@H](O)C[NH2+]C...       0
1207  O=C(C1C[NH2+]CC1c1ccccc1Br)N1CCC(c2ccccc2)CC1c...       0
1208  Nc1cccc(Cn2c(-c3ccc(Oc4ccncc4)cc3)ccc2-c2ccccc...       0
1209  CC1(C)Cc2cc(Cl)ccc2C(NC(Cc2cscc2CCC2CC2)C(=O)[...       0

[1210 rows x 2 columns]
POSITIVE: 515
NEGETIVE: 695
------ test  data ------
                                                SMILES  TARGET
0    Cc1ccccc1-c1ccc2nc(N)c(CC(C)C(=O)NCC3CCOCC3)cc2c1       1
1    CC(C)(C)Cc1cnc2c(c1)C([NH2+]CC(O)C1Cc3cccc(c3)...       1
2    COCC(=O)NC(Cc1cccc(-c2ncco2)c1)C(O)C[NH2+]C1CC...       1
3    COCC(=O)NC(Cc1cccc(-c2ccccn2)c1)C(O)C[NH2+]C1C...       1
4    CCn1cc2c3c(cc(C(=O)NC(Cc4ccccc4)[C@H](O)C[NH2+...       1
..                                                 ...     ...
298  CCn1cc2c3c(cc(C(=O)NC(Cc4ccccc4)C(=O)C[NH2+]C4...       1
299      CN1CCC(C)(c2cc(NC(=O)c3ccc(Cl)cn3)ccc2F)N=C1N       1
300  COc1cccc(C[NH2+]C[C@H](O)[C@H](Cc2cc(F)cc(F)c2...       1
301  CC(C)(C)c1cccc(C[NH2+]C2CS(=O)(=O)CC(Cc3cc(F)c...       1
302  COc1cccc(C[NH2+]CC(O)C(Cc2ccccc2)NC(=O)C(CCc2c...       1

[303 rows x 2 columns]
POSITIVE: 176
NEGETIVE: 127

代码

文本

As we can see, in the BACE dataset:

Molecules are represented by SMILES strings.
The task is a binary classification task, where TARGET == 1 indicates that the molecule is active against the BACE-1 target, and TARGET == 0 indicates that the molecule is not active against the BACE-1 target.

This is a common molecular property prediction task. Alright, let's set this dataset aside for now. Next, let's officially begin our exploration.

代码

文本

A Brief History of QSAR

Quantitative Structure-Activity Relationship (QSAR) is a method used to study the quantitative relationship between the chemical structure of compounds and their biological activity. It is one of the most important tools in Computer-Aided Drug Design (CADD). QSAR aims to establish mathematical models that link molecular structures with their biochemical and physicochemical properties, helping drug scientists make rational predictions about the properties of new drug molecules.

QSAR evolved from Structure-Activity Relationship (SAR) analysis. The origins of SAR can be traced back to the late 19th century when chemists began studying the relationship between the structure of compounds and their biological activity. German chemist Paul Ehrlich (1854-1915) proposed the "lock-and-key" hypothesis, suggesting that the interaction between compounds (the key) and biological targets (the lock) depends on their spatial compatibility. As scientists gained a deeper understanding of molecular interactions, they found that, besides spatial matching, the surface properties of the target (such as hydrophobicity and electrophilicity) and the corresponding structural properties of the ligand were also crucial. This led to the development of various methods to evaluate the relationship between structural characteristics and binding affinity, known as structure-activity relationships.

However, SAR methods mainly relied on chemists' experience and intuition, lacking a solid theoretical foundation and unified analytical approach. To overcome these limitations, scientists began using mathematical and statistical methods in the 1960s to perform quantitative analysis of the relationship between molecular structure and biological activity.

The earliest QSAR model dates back to 1868 when chemist Alexander Crum Brown and physiologist Thomas R. Fraser studied the relationship between compound structure and biological activity. While examining the biological effects of methylation of the basic nitrogen atom in alkaloids, they proposed that the physiological activity of a compound depends on its composition, expressed as: $ϕ = f (C)$ , where $ϕ$ represents biological activity and $C$ represents compound composition. This Crum-Brown equation laid the foundation for subsequent QSAR research.

Over time, many QSAR models were introduced, such as Hammett's model linking organic compound toxicity to molecular electronic properties, and Taft's steric parameter model. In 1964, Hansch and Fujita proposed the well-known Hansch model, which posits that a molecule's biological activity is mainly determined by its hydrophobic effect ( $lo g P$ ), steric effect ( $E_{s}$ ), and electronic effect ( $σ$ ), assuming these effects are independently additive. The complete form is expressed as: $lo g (1/ C) = α lo g P + β E_{s} + θ σ + η$ . The Hansch model was the first to quantitatively describe the relationship between chemical information and drug biological activity, providing a practical framework for subsequent QSAR research and marking the transition from blind drug design to rational drug design.

Today, QSAR has matured into a well-established field involving various computational methods and techniques. Recently, with the rapid development of machine learning and artificial intelligence, QSAR methods have been further expanded and applied. For example, deep learning techniques have been employed to build QSAR models, improving their predictive power and accuracy. Additionally, QSAR approaches have found extensive applications in environmental science, materials science, and other fields, demonstrating their strong potential and broad applicability.

代码

文本

Basic Requirements for QSAR Modeling

At an international conference held in Setubal, Portugal, in 2002, scientists proposed several rules regarding the validity of QSAR models, known as the "Setubal Principles." These rules were further refined in November 2004 and officially named the "OECD Principles." For a QSAR model to be used for regulatory purposes, it should meet the following 5 conditions:

A defined endpoint
An unambiguous algorithm
A defined domain of applicability
Appropriate measures of goodness-of-fit, robustness, and predictivity
A mechanistic interpretation, if possible

代码

文本

Basic Workflow of QSAR Modeling

Building an effective QSAR model mainly involves three steps:

Constructing a reasonable molecular representation, which converts molecular structures into computer-readable numerical representations;
Selecting a suitable machine learning model for the molecular representation and using existing molecule-property data to train the model;
Using the trained machine learning model to predict the properties of molecules with unknown properties.

Since molecular structures are not in a computer-readable format, we must first convert them into numerical vectors that can be read by computers. This allows for the selection of appropriate mathematical models based on these representations. We call this process molecular representation. Effective molecular representation and the choice of compatible mathematical models are the core of building quantitative structure-activity relationship models.

代码

文本

Molecular Representation

Molecular representation is a numerical depiction that includes molecular properties. Common molecular representation methods include molecular descriptors, fingerprints, SMILES strings, and molecular potential functions.

Wei, J., Chu, X., Sun, X. Y., Xu, K., Deng, H. X., Chen, J., ... & Lei, M. (2019). Machine learning in materials science. InfoMat, 1(3), 338-358.

In fact, the development of QSAR has evolved along with the increasing information content and changing forms of molecular representations, leading to the classification of QSAR models into 1D-QSAR, 2D-QSAR, and 3D-QSAR:

Different molecular representations have distinct numerical characteristics, requiring different machine learning/deep learning models for modeling. Next, we will demonstrate how to build 1D-QSAR, 2D-QSAR, and 3D-QSAR models through practical examples.

代码

文本

1D-QSAR Molecular Representation

Early quantitative structure-activity relationship models mostly used physicochemical properties of molecules, such as molecular weight, water solubility, and molecular surface area, as the method of representation. These physicochemical properties are known as molecular descriptors. This defines the 1D-QSAR stage.

At this stage, experienced scientists often rely on their domain knowledge to design molecular descriptors, constructing properties that may be related to the characteristic being studied. For example, if the goal is to predict whether a drug can pass through the blood-brain barrier, this property may be related to the drug's water solubility, molecular weight, polar surface area, and other physicochemical attributes. Scientists would include such attributes in the molecular descriptors.

During this period, due to limited access to computers or insufficient computational power, scientists often used simple mathematical models for modeling, such as linear regression and random forests. Since molecular representations constructed from descriptors are typically low-dimensional real-valued vectors, these mathematical models are well-suited for this kind of work.

代码

文本

[13]

from rdkit import Chem

from rdkit.Chem import Descriptors

def calculate_1dqsar_repr(smiles):

# Create a molecule object from the SMILES string

mol = Chem.MolFromSmiles(smiles)

# Calculate the molecular weight of the molecule

mol_weight = Descriptors.MolWt(mol)

# Calculate the LogP value of the molecule

log_p = Descriptors.MolLogP(mol)

# Calculate the number of hydrogen bond donors in the molecule

num_h_donors = Descriptors.NumHDonors(mol)

# Calculate the number of hydrogen bond acceptors in the molecule

num_h_acceptors = Descriptors.NumHAcceptors(mol)

# Calculate the topological polar surface area (TPSA) of the molecule

tpsa = Descriptors.TPSA(mol)

# Calculate the number of rotatable bonds in the molecule

num_rotatable_bonds = Descriptors.NumRotatableBonds(mol)

# Calculate the number of aromatic rings in the molecule

num_aromatic_rings = Descriptors.NumAromaticRings(mol)

# Calculate the number of aliphatic rings in the molecule

num_aliphatic_rings = Descriptors.NumAliphaticRings(mol)

# Calculate the number of saturated rings in the molecule

num_saturated_rings = Descriptors.NumSaturatedRings(mol)

# Calculate the number of heteroatoms in the molecule

num_heteroatoms = Descriptors.NumHeteroatoms(mol)

# Calculate the number of valence electrons in the molecule

num_valence_electrons = Descriptors.NumValenceElectrons(mol)

# Calculate the number of radical electrons in the molecule

num_radical_electrons = Descriptors.NumRadicalElectrons(mol)

# Calculate the quantitative estimate of drug-likeness (QED) value of the molecule

qed = Descriptors.qed(mol)

# Return all calculated properties

return [mol_weight, log_p, num_h_donors, num_h_acceptors, tpsa, num_rotatable_bonds, num_aromatic_rings,

num_aliphatic_rings, num_saturated_rings, num_heteroatoms, num_valence_electrons, num_radical_electrons, qed]

# Apply the function to calculate 1D-QSAR molecular representations for training and testing data

train_data["1dqsar_mr"] = train_data["SMILES"].apply(calculate_1dqsar_repr)

test_data["1dqsar_mr"] = test_data["SMILES"].apply(calculate_1dqsar_repr)

代码

文本

[14]

print(train_data["1dqsar_mr"][0])

[433.40500000000026, 3.558600000000002, 1, 4, 67.92, 6, 2, 1, 0, 9, 162, 0, 0.4304551686487377]

代码

文本

[15]

import numpy as np

from sklearn.linear_model import LogisticRegression, SGDClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.naive_bayes import BernoulliNB

from sklearn.neural_network import MLPClassifier

from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve

import matplotlib.pyplot as plt

# Convert training and testing data to NumPy arrays

train_x = np.array(train_data["1dqsar_mr"].values.tolist())

train_y = np.array(train_data["TARGET"].values.tolist())

test_x = np.array(test_data["1dqsar_mr"].values.tolist())

test_y = np.array(test_data["TARGET"].values.tolist())

# Define the list of classifiers to use

classifiers = [

("Logistic Regression", LogisticRegression(random_state=42)),

("Stochastic Gradient Descent", SGDClassifier(loss='log', random_state=42)),

("K-Nearest Neighbors", KNeighborsClassifier()),

("Bernoulli Naive Bayes", BernoulliNB()),

("Decision Tree", DecisionTreeClassifier(random_state=42)),

("Random Forest", RandomForestClassifier(random_state=42)),

("XGBoost", XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42)),

("Multi-layer Perceptron", MLPClassifier(hidden_layer_sizes=(128,64,32), activation='relu', solver='adam', max_iter=10000, random_state=42)),

]

# Initialize the results dictionary

results = {}

# Train, predict, and calculate performance metrics for each classifier

for name, classifier in classifiers:

classifier.fit(train_x, train_y)

pred_y = classifier.predict(test_x)

pred_y_proba = classifier.predict_proba(test_x)[:, 1]

accuracy = accuracy_score(test_y, pred_y)

auc = roc_auc_score(test_y, pred_y_proba)

fpr, tpr, _ = roc_curve(test_y, pred_y_proba)

results[f"1D-QSAR-{name}"] = {"ACC": accuracy, "AUC": auc, "FPR": fpr, "TPR": tpr}

print(f"[1D-QSAR][{name}]\tACC:{accuracy:.4f}\tAUC:{auc:.4f}")

# Sort results by AUC in descending order

sorted_results = sorted(results.items(), key=lambda x: x[1]["AUC"], reverse=True)

# Plot ROC curves

plt.figure(figsize=(10, 6), dpi=200)

plt.title("ROC Curves")

plt.xlabel("False Positive Rate")

plt.ylabel("True Positive Rate")

for name, result in sorted_results:

if name.startswith("1D-QSAR"):

plt.plot(result["FPR"], result["TPR"], label=f"{name} (ACC:{result['ACC']:.4f} AUC:{result['AUC']:.4f})")

plt.legend(loc="lower right")

plt.show()

代码

文本

2D-QSAR Molecular Characterization

However, when facing the challenge of predicting molecular properties with unclear biochemical mechanisms, scientists may find it difficult to design effective molecular descriptors to characterize molecules, leading to the failure of QSAR model construction. Since molecular properties are largely determined by molecular structure, such as the functional groups present on the molecule, there is an interest in incorporating the bonding relationships of molecules into QSAR modeling. Thus, the field has entered the stage of 2D-QSAR.

One of the earlier proposed methods is the molecular fingerprint method, such as Morgan fingerprints, which characterizes molecules by traversing the bonding relationships of each atom and its surrounding atoms. To meet the requirement that molecules of different sizes can be represented by numerical vectors of the same length, molecular fingerprints often use hashing operations to ensure uniform vector length, resulting in high-dimensional 0/1 vectors. In this scenario, scientists typically choose machine learning methods that handle high-dimensional sparse vectors well, such as support vector machines and fully connected neural networks, for model construction.

With the development of AI models, deep learning models capable of handling sequence data (e.g., text) like Recurrent Neural Networks (RNN), image data like Convolutional Neural Networks (CNN), and unstructured graph data like Graph Neural Networks (GNN) have been proposed and applied. QSAR models have also been constructed to fit molecular representations based on the data characteristics these models can handle. For example, SMILES string representations of molecules have been applied in RNN modeling, 2D images of molecules in CNN modeling, and the bonding topological structure of molecules converted into graphs in GNN modeling, leading to the development of a series of QSAR modeling methods.

Overall, in the 2D-QSAR stage, various methods are utilized to analyze the bonding relationships (topological structure) of molecules to model and predict molecular properties.

代码

文本

[ ]

import numpy as np

from rdkit.Chem import AllChem

def calculate_2dqsar_repr(smiles):

# Convert the SMILES string to an RDKit molecule object

mol = Chem.MolFromSmiles(smiles)

# Calculate the Morgan fingerprint (radius 3, length 1024 bits)

fp = AllChem.GetMorganFingerprintAsBitVect(mol, 3, nBits=1024)

# Return the fingerprint as a NumPy array

return np.array(fp)

train_data["2dqsar_mr"] = train_data["SMILES"].apply(calculate_2dqsar_repr)

test_data["2dqsar_mr"] = test_data["SMILES"].apply(calculate_2dqsar_repr)

代码

文本

[ ]

print(train_data["2dqsar_mr"][0].tolist())

代码

文本

[ ]

import numpy as np

from sklearn.linear_model import LogisticRegression, SGDClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.naive_bayes import BernoulliNB

from sklearn.neural_network import MLPClassifier

from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve

import matplotlib.pyplot as plt

# Convert training and testing data to NumPy arrays

train_x = np.array(train_data["2dqsar_mr"].values.tolist())

train_y = np.array(train_data["TARGET"].values.tolist())

test_x = np.array(test_data["2dqsar_mr"].values.tolist())

test_y = np.array(test_data["TARGET"].values.tolist())

# Define the list of classifiers to use

classifiers = [

("Logistic Regression", LogisticRegression(random_state=42)),

("Stochastic Gradient Descent", SGDClassifier(loss='log', random_state=42)),

("K-Nearest Neighbors", KNeighborsClassifier()),

("Bernoulli Naive Bayes", BernoulliNB()),

("Decision Tree", DecisionTreeClassifier(random_state=42)),

("Random Forest", RandomForestClassifier(random_state=42)),

("XGBoost", XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42)),

("Multi-layer Perceptron", MLPClassifier(hidden_layer_sizes=(128,64,32), activation='relu', solver='adam', max_iter=10000, random_state=42)),

]

# Train, predict, and calculate performance metrics for each classifier

for name, classifier in classifiers:

classifier.fit(train_x, train_y)

pred_y = classifier.predict(test_x)

pred_y_proba = classifier.predict_proba(test_x)[:, 1]

accuracy = accuracy_score(test_y, pred_y)

auc = roc_auc_score(test_y, pred_y_proba)

fpr, tpr, _ = roc_curve(test_y, pred_y_proba)

results[f"2D-QSAR-{name}"] = {"ACC": accuracy, "AUC": auc, "FPR": fpr, "TPR": tpr}

print(f"[2D-QSAR][{name}]\tACC:{accuracy:.4f}\tAUC:{auc:.4f}")

# Sort results by AUC in descending order

sorted_results = sorted(results.items(), key=lambda x: x[1]["AUC"], reverse=True)

# Plot ROC curves

plt.figure(figsize=(10, 6), dpi=200)

plt.title("ROC Curves")

plt.xlabel("False Positive Rate")

plt.ylabel("True Positive Rate")

for name, result in sorted_results:

if name.startswith("2D-QSAR"):

plt.plot(result["FPR"], result["TPR"], label=f"{name} (ACC:{result['ACC']:.4f} AUC:{result['AUC']:.4f})")

plt.legend(loc="lower right")

plt.show()

代码

文本

3D-QSAR Molecular Characterization

However, due to the presence of intermolecular and intramolecular interactions, molecules with similar topological structures may adopt different conformations in various environments. The conformation of each molecule in different environments and the corresponding energy levels determine the true nature of the molecule. Therefore, scientists aim to incorporate the three-dimensional structure of molecules into QSAR modeling to enhance the ability to predict molecular properties in specific scenarios. This stage is referred to as the 3D-QSAR stage.

The Comparative Molecular Field Analysis (CoFMA) is a widely used 3D-QSAR model. It calculates the forces (i.e., force fields) at various positions in the space where the molecule exists (usually by selecting positions through a grid method) to characterize the three-dimensional structure of the molecule. Of course, there are other beneficial attempts in the field, including characterization methods through electron density, three-dimensional molecular images, or adding geometric information to molecular graphs.

To handle such high-dimensional spatial information, scientists often choose deep learning methods such as deeper FCNN, 3D-CNN, GNN, etc., for modeling.

代码

文本

[ ]

from rdkit.Chem import rdPartialCharges

def calculate_3dqsar_repr(SMILES, max_atoms=100, three_d=False):

mol = Chem.MolFromSmiles(SMILES) # Create a molecule object from the SMILES representation

mol = Chem.AddHs(mol) # Add hydrogen atoms

if three_d:

AllChem.EmbedMolecule(mol, AllChem.ETKDG()) # Calculate 3D coordinates

else:

AllChem.Compute2DCoords(mol) # Calculate 2D coordinates

natoms = mol.GetNumAtoms() # Get the number of atoms

rdPartialCharges.ComputeGasteigerCharges(mol) # Compute Gasteiger charges for the molecule

charges = np.array([float(atom.GetProp("_GasteigerCharge")) for atom in mol.GetAtoms()]) # Retrieve charge values

coords = mol.GetConformer().GetPositions() # Get atomic coordinates

coulomb_matrix = np.zeros((max_atoms, max_atoms)) # Initialize the Coulomb matrix

n = min(max_atoms, natoms)

for i in range(n): # Iterate over atoms

for j in range(i, n):

if i == j:

coulomb_matrix[i, j] = 0.5 * charges[i] ** 2

if i != j:

delta = np.linalg.norm(coords[i] - coords[j]) # Calculate the distance between atoms

if delta != 0:

coulomb_matrix[i, j] = charges[i] * charges[j] / delta # Calculate elements of the Coulomb matrix

coulomb_matrix[j, i] = coulomb_matrix[i, j]

coulomb_matrix = np.where(np.isinf(coulomb_matrix), 0, coulomb_matrix) # Handle infinite values

coulomb_matrix = np.where(np.isnan(coulomb_matrix), 0, coulomb_matrix) # Handle NaN values

return coulomb_matrix.reshape(max_atoms * max_atoms).tolist() # Convert Coulomb matrix to a list and return

train_data["3dqsar_mr"] = train_data["SMILES"].apply(calculate_3dqsar_repr)

test_data["3dqsar_mr"] = test_data["SMILES"].apply(calculate_3dqsar_repr)

代码

文本

[ ]

print(len(train_data.iloc[0]["3dqsar_mr"]))

代码

文本

[ ]

import numpy as np

from sklearn.linear_model import LogisticRegression, SGDClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.naive_bayes import BernoulliNB

from sklearn.neural_network import MLPClassifier

from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve

import matplotlib.pyplot as plt

# Convert training and testing data to NumPy arrays

train_x = np.array(train_data["3dqsar_mr"].values.tolist())

train_y = np.array(train_data["TARGET"].values.tolist())

test_x = np.array(test_data["3dqsar_mr"].values.tolist())

test_y = np.array(test_data["TARGET"].values.tolist())

# Define the list of classifiers to use

classifiers = [

("Logistic Regression", LogisticRegression(random_state=42)),

("Stochastic Gradient Descent", SGDClassifier(loss='log', random_state=42)),

("K-Nearest Neighbors", KNeighborsClassifier()),

("Bernoulli Naive Bayes", BernoulliNB()),

("Decision Tree", DecisionTreeClassifier(random_state=42)),

("Random Forest", RandomForestClassifier(random_state=42)),

("XGBoost", XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42)),

("Multi-layer Perceptron", MLPClassifier(hidden_layer_sizes=(128,64,32), activation='relu', solver='adam', max_iter=10000, random_state=42)),

]

# Train, predict, and calculate performance metrics for each classifier

for name, classifier in classifiers:

classifier.fit(train_x, train_y)

pred_y = classifier.predict(test_x)

pred_y_proba = classifier.predict_proba(test_x)[:, 1]

accuracy = accuracy_score(test_y, pred_y)

auc = roc_auc_score(test_y, pred_y_proba)

fpr, tpr, _ = roc_curve(test_y, pred_y_proba)

results[f"3D-QSAR-{name}"] = {"ACC": accuracy, "AUC": auc, "FPR": fpr, "TPR": tpr}

print(f"[3D-QSAR][{name}]\tACC:{accuracy:.4f}\tAUC:{auc:.4f}")

# Sort results by AUC in descending order

sorted_results = sorted(results.items(), key=lambda x: x[1]["AUC"], reverse=True)

# Plot ROC curves

plt.figure(figsize=(10, 6), dpi=200)

plt.title("ROC Curves")

plt.xlabel("False Positive Rate")

plt.ylabel("True Positive Rate")

for name, result in sorted_results:

if name.startswith("3D-QSAR"):

plt.plot(result["FPR"], result["TPR"], label=f"{name} (ACC:{result['ACC']:.4f} AUC:{result['AUC']:.4f})")

plt.legend(loc="lower right")

plt.show()

代码

文本

Uni-Mol Molecular Representation Learning and Pretraining Framework

Pretraining Model

One of the main challenges in QSAR modeling within the field of drug development is the limited amount of data. Due to the high cost and experimental difficulty of obtaining drug activity data, there is often a lack of labeled data. Insufficient data affects the model's predictive ability, as it may be difficult for the model to capture enough information to describe the relationship between compound structure and biological activity.

Faced with this situation of insufficient labeled data, the pretrain-finetune approach has become a common solution in more mature fields of machine learning, such as natural language processing (NLP) and computer vision (CV). Pretraining involves training the model on a large amount of unlabeled data through self-supervised learning, allowing the model to gain basic information and general capabilities. The model is then fine-tuned on a smaller set of labeled data through supervised learning to equip it with specific problem-solving abilities.

For example, if I want to perform image recognition of cats and dogs but lack sufficient labeled data, I can first pretrain the model using a large set of unlabeled images, enabling it to learn basic concepts of lines, shapes, and contours. Afterward, I can use supervised learning with cat and dog images, allowing the model to quickly learn to distinguish between cats and dogs based on contour information.

The pretraining approach can effectively utilize large amounts of easily accessible unlabeled data to improve the model's generalization ability and predictive performance. In QSAR modeling, we can also leverage the concept of pretraining to address issues related to data quantity and quality.

Introduction to Uni-Mol

Uni-Mol is a universal molecular representation learning framework based on 3D molecular structures, released by DeepModeling in May 2022. Uni-Mol takes 3D molecular structures as model input and uses around 200 million small molecule conformations and 3 million protein surface cavity structures. It pretrains the model using two self-supervised tasks: atom type restoration and atom coordinate restoration.

Uni-Mol Paper: https://openreview.net/forum?id=6K2RM6wVqKu
Open-source Code: https://github.com/dptech-corp/Uni-Mol

The representation learning from 3D information and the effective pretraining approach allow Uni-Mol to outperform SOTA (state of the art) models in almost all downstream tasks related to drug molecules and protein pockets. Uni-Mol can directly handle tasks such as molecular conformation generation and protein-ligand binding pose prediction, surpassing existing solutions. The paper was accepted at the top machine learning conference ICLR 2023.

Next, we will use Uni-Mol to build a BACE-1 molecular activity prediction task:

代码

文本

[ ]

from unimol import MolTrain, MolPredict

clf = MolTrain(task='classification',

data_type='molecule',

epochs=100,

learning_rate=0.00008,

batch_size=64,

early_stopping=10,

metrics='auc',

split='random',

save_path='./exp',

)

clf.fit('datasets/BACE_train.csv')

代码

文本

[ ]

predm = MolPredict(load_model='./exp')

test_pred = predm.predict('datasets/BACE_test.csv')

代码

文本

[ ]

unimol_results = pd.DataFrame({'pred':test_pred.reshape(-1),

'smiles':test_data['SMILES'],

'target':test_data['TARGET']})

plt.figure(figsize=(10, 6), dpi=200)

plt.title("ROC Curves")

plt.xlabel("False Positive Rate")

plt.ylabel("True Positive Rate")

accuracy = accuracy_score(

unimol_results.target,

[1 if p > 0.5 else 0 for p in unimol_results.pred]

)

auc = roc_auc_score(unimol_results.target, unimol_results.pred)

fpr, tpr, _ = roc_curve(unimol_results.target, unimol_results.pred)

plt.plot(fpr, tpr, label=f"Uni-Mol (ACC:{accuracy:.4f} AUC: {auc:.4f})")

print(f"[Uni-Mol]\tACC:{accuracy:.4f}\t AUC:{auc:.4f}")

results["Uni-Mol"] = {"ACC": accuracy, "AUC": auc, "FPR":fpr, "TPR":tpr}

plt.legend(loc="lower right")

plt.show()

代码

文本

Overview of Results

Finally, we can make a horizontal comparison of the predictive performance of 1D-QSAR, 2D-QSAR, and 3D-QSAR combined with different models, as well as Uni-Mol on the same dataset.

代码

文本

[ ]

import pandas as pd

df = pd.DataFrame(results).T.drop(columns=["FPR", "TPR"])

df.sort_values(by="AUC", ascending=False, inplace=True)

代码

文本

[ ]

代码

文本

Uni-Mol

点个赞吧