空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

Quantitative Structure-Activity Relationship (QSAR) Model from 0 to 1 (Regression Task)

Deep Learning

RDKit

QSAR

Tutorial

Machine Learning

Uni-Mol

notebook

Scikit-Learn

Deep LearningRDKitQSARTutorialMachine LearningUni-MolnotebookScikit-Learn

Yani Guan

更新于 2024-10-24

推荐镜像 :Uni-Mol:unimol-qsar:v0.5

推荐机型 :c12_m92_1 * NVIDIA V100

Quantitative Structure-Activity Relationship (QSAR) Model from 0 to 1 & Uni-Mol Introductory Practice (Regression Task)

Table of Contents

Introduction

Let's Prepare Some Data!

A Brief History of QSAR

Basic Requirements for QSAR Modeling

Basic Workflow of QSAR Modeling

Molecular Representation

1D-QSAR Molecular Representation

2D-QSAR Molecular Characterization

3D-QSAR Molecular Characterization

Uni-Mol Molecular Representation Learning and Pretraining Framework

Pretraining Model

Introduction to Uni-Mol

Results Overview

One More Thing

Quantitative Structure-Activity Relationship (QSAR) Model from 0 to 1 & Uni-Mol Introductory Practice (Regression Task)

©️ Copyright 2023 @ Authors
Author: Hang Zheng 📨
Date: 2023-06-16
Sharing Agreement: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Quick Start: Click the Start Connection button above, select the unimol-qsar:v0.2 image and any GPU node configuration, and wait a moment to run.

代码

文本

In recent years, Artificial Intelligence (AI) has been developing at an unprecedented speed, bringing significant breakthroughs and transformations to various fields.

In fact, in the field of drug development, drug scientists have been using a series of mathematical and statistical methods to aid the drug development process since the last century. Based on the structure of drug molecules, they construct mathematical simulations to predict the biochemical activity of drugs. This method is known as Quantitative Structure-Activity Relationship (QSAR). QSAR models have continued to evolve with the deepening research on drug molecules and the introduction of more AI methods.

It can be said that QSAR models are a good microcosm of the development of the AI for Science field. In this Notebook, we will introduce the construction methods of different types of QSAR models in the form of case studies.

代码

文本

Introduction
- Data Preparation
A Brief History of QSAR
Basic Requirements for QSAR Modeling
Basic Workflow of QSAR Modeling
Molecular Representation
- 1D-QSAR
- 2D-QSAR
- 3D-QSAR
Uni-Mol Molecular Representation Learning and Pretraining
- Pretraining
- Uni-Mol

代码

文本

Introduction

Quantitative Structure-Activity Relationship (QSAR) is a method that studies the quantitative relationship between the chemical structure of compounds and their biological activity. It is one of the most important tools in Computer-Aided Drug Design (CADD). QSAR aims to establish mathematical models to relate molecular structures with their biochemical and physicochemical properties, helping drug scientists to make rational predictions about the properties of new drug molecules.

Building an effective QSAR model involves several steps:

Constructing a reasonable molecular representation, which converts molecular structures into computer-readable numerical representations;
Selecting a suitable machine learning model for the molecular representation and using existing molecule-property data to train the model;
Using the trained machine learning model to predict the properties of molecules with unknown properties.

The development of QSAR models has evolved with the progression of molecular representation techniques and the corresponding upgrades in machine learning models. In this notebook, we will introduce the construction methods of different types of QSAR models through case studies.

代码

文本

Let's Prepare Some Data!

To better guide everyone through the process of building QSAR models, we will use the prediction of hERG protein inhibitory capability as a demonstration case.

We can start by downloading the hERG dataset:

代码

文本

[1]

import os

os.makedirs("datasets", exist_ok=True)

!pip install seaborn

!pip install lightgbm

!wget https://dp-public.oss-cn-beijing.aliyuncs.com/community/hERG.csv -O datasets/hERG.csv

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting seaborn
  Downloading seaborn-0.12.2-py3-none-any.whl (293 kB)
     |████████████████████████████████| 293 kB 338 kB/s eta 0:00:01
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /opt/conda/lib/python3.8/site-packages (from seaborn) (3.7.1)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in /opt/conda/lib/python3.8/site-packages (from seaborn) (1.20.3)
Requirement already satisfied: pandas>=0.25 in /opt/conda/lib/python3.8/site-packages (from seaborn) (1.5.3)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.5.0)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.7)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.39.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.1)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: importlib-resources>=3.2.0 in /opt/conda/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (5.12.0)
Requirement already satisfied: zipp>=3.1.0 in /opt/conda/lib/python3.8/site-packages (from importlib-resources>=3.2.0->matplotlib!=3.6.1,>=3.1->seaborn) (3.15.0)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.8/site-packages (from pandas>=0.25->seaborn) (2023.3)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.8/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)
Installing collected packages: seaborn
Successfully installed seaborn-0.12.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting lightgbm
  Downloading lightgbm-3.3.5-py3-none-manylinux1_x86_64.whl (2.0 MB)
     |████████████████████████████████| 2.0 MB 319 kB/s eta 0:00:01
Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/lib/python3.8/site-packages (from lightgbm) (0.24.2)
Requirement already satisfied: wheel in /opt/conda/lib/python3.8/site-packages (from lightgbm) (0.40.0)
Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (from lightgbm) (1.20.3)
Requirement already satisfied: scipy in /opt/conda/lib/python3.8/site-packages (from lightgbm) (1.6.3)
Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.8/site-packages (from scikit-learn!=0.22.0->lightgbm) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.8/site-packages (from scikit-learn!=0.22.0->lightgbm) (3.1.0)
Installing collected packages: lightgbm
Successfully installed lightgbm-3.3.5
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
--2023-06-17 12:34:08--  https://dp-public.oss-cn-beijing.aliyuncs.com/community/hERG.csv
Resolving ga.dp.tech (ga.dp.tech)... 10.255.255.41
Connecting to ga.dp.tech (ga.dp.tech)|10.255.255.41|:8118... connected.
Proxy request sent, awaiting response... 200 OK
Length: 560684 (548K) [text/csv]
Saving to: ‘datasets/hERG.csv’

datasets/hERG.csv   100%[===================>] 547.54K  --.-KB/s    in 0.08s   

2023-06-17 12:34:09 (6.55 MB/s) - ‘datasets/hERG.csv’ saved [560684/560684]

代码

文本

Then, we can take a look at the composition of this dataset:

代码

文本

[8]

import pandas as pd

import numpy as np

data = pd.read_csv("./datasets/hERG.csv")

print("------------ Original data ------------")

print(data)

data.columns = ["SMILES", "TARGET"]

# Set 80% of the dataset as the training set and 20% as the test set

train_fraction = 0.8

train_data = data.sample(frac=train_fraction, random_state=1)

train_data.to_csv("./datasets/hERG_train.csv", index=False)

test_data = data.drop(train_data.index)

test_data.to_csv("./datasets/hERG_test.csv", index=False)

# Set training/test targets

train_y = np.array(train_data["TARGET"].values.tolist())

test_y = np.array(test_data["TARGET"].values.tolist())

# Create a results dictionary to store future results

results = {}

# Visualize the results

import matplotlib.pyplot as plt

import seaborn as sns

plt.figure(figsize=(6,4), dpi=150)

font = {'family': 'serif',

'color': 'black',

'weight': 'normal',

'size': 15}

plt.hist(train_data["TARGET"], bins=20, label="Train Data")

plt.hist(test_data["TARGET"], bins=20, label="Test Data")

plt.ylabel("Count", fontdict=font)

plt.xlabel("pIC50", fontdict=font)

plt.legend()

plt.show()

------------ Original data ------------
                                                 SMILES  pIC50
0     Cc1ccc(CN2[C@@H]3CC[C@H]2C[C@@H](C3)Oc4cccc(c4...   9.85
1     COc1nc2ccc(Br)cc2cc1[C@@H](c3ccccc3)[C@@](O)(C...   9.70
2     NC(=O)c1cccc(O[C@@H]2C[C@H]3CC[C@@H](C2)N3CCCc...   9.60
3                          CCCCCCCc1cccc([n+]1C)CCCCCCC   9.60
4     Cc1ccc(CN2[C@@H]3CC[C@H]2C[C@@H](C3)Oc4cccc(c4...   9.59
...                                                 ...    ...
9199  O=C1[C@H]2N(c3ccc(OCC=CCCNCC(=O)Nc4c(Cl)cc(cc4...   4.89
9200  O=C1[C@H]2N(c3ccc(OCCCCCNCC(=O)Nc4c(Cl)cc(cc4C...   4.89
9201  O=C1[C@H]2N(c3ccc(OCC=CCCCNCC(=O)Nc4c(Cl)cc(cc...   4.89
9202  O=C1[C@H]2N(c3ccc(OCCCCCCNCC(=O)Nc4c(Cl)cc(cc4...   4.49
9203  O=C1N=C/C(=C2\N(c3c(cc(Cl)c(Cl)c3)N\2)Cc4cc(Cl...   5.30

[9204 rows x 2 columns]

<Figure size 900x600 with 1 Axes>

代码

文本

You can see that in the hERG dataset:

Molecules are represented by SMILES strings;
The task objective is a regression prediction task, predicting the inhibitory activity of molecules on proteins, represented by pIC50.

This is a common molecular property prediction task. Alright, let's put this dataset aside for now. Next, let's officially start exploring.

代码

文本

A Brief History of QSAR

Quantitative Structure-Activity Relationship (QSAR) is a method that studies the quantitative relationship between the chemical structure of compounds and their biological activity. It is one of the most important tools in Computer-Aided Drug Design (CADD). QSAR aims to establish mathematical models to relate molecular structures with their biochemical and physicochemical properties, helping drug scientists to make rational predictions about the properties of new drug molecules.

QSAR evolved from Structure-Activity Relationship (SAR) analysis. The origins of SAR can be traced back to the late 19th century when chemists began studying the relationship between compound structures and biological activity. German chemist Paul Ehrlich (1854-1915) proposed the "lock-and-key" hypothesis, suggesting that the interaction between compounds (keys) and biological targets (locks) depends on their spatial matching. As scientists deepened their understanding of molecular interactions, they realized that besides spatial matching, the properties of the target surface (e.g., hydrophobicity, electrophilicity) and the corresponding properties of the ligand structure were also crucial. This led to the development of a series of methods to evaluate the structural characteristics and binding affinity, known as Structure-Activity Relationships.

However, the SAR method mainly relied on the experience and intuitive judgment of chemists, lacking a rigorous theoretical foundation and unified analytical approach. To overcome these limitations, scientists began using mathematical and statistical methods in the 1960s to conduct quantitative analysis of the relationship between molecular structure and biological activity.

The earliest proposed QSAR model can be traced back to 1868, when chemist Alexander Crum Brown and physiologist Thomas R. Fraser began studying the relationship between compound structure and biological activity. In their research on the biological effects before and after methylation of the basic nitrogen atoms in alkaloids, they proposed that the physiological activity of a compound depends on the composition of its components, expressed as biological activity $ϕ$ being a function of the compound composition $C$ : $ϕ = f (C)$ . This is known as the Crum-Brown Equation, laying the foundation for future QSAR research.

Subsequently, various QSAR models were proposed in academia, such as the QSAR model linking organic compound toxicity to molecular electronics introduced by Hammett, and the steric parameter model proposed by Taft. In 1964, Hansch and Fujita introduced the well-known Hansch model, which suggested that a molecule's biological activity is mainly determined by its hydrophobic effect ( $lo g P$ ), steric effect ( $E_{s}$ ), and electronic effect ( $σ$ ), and assumed that these three effects can be independently additive. The complete form of the model is: $lo g (1/ C) = α lo g P + β E_{s} + θ σ + η$ . The Hansch model was the first to quantitatively describe the relationship between chemical information and drug biological activity, providing a practical theoretical framework for subsequent QSAR research. It is considered a crucial milestone in the transition from blind drug design to rational drug design.

Today, QSAR has developed into a mature research field involving various computational methods and techniques. In recent years, with the rapid development of machine learning and artificial intelligence technologies, QSAR methods have been further expanded and applied. For example, deep learning techniques have been used to build QSAR models, enhancing their predictive capabilities and accuracy. Furthermore, QSAR methods have found broad applications in fields such as environmental science and materials science, demonstrating strong potential and a wide range of application prospects.

代码

文本

Basic Requirements for QSAR Modeling

At an international conference held in Setubal, Portugal, in 2002, scientists proposed several rules regarding the validity of QSAR models, known as the "Setubal Principles." These rules were further refined in November 2004 and officially named the "OECD Principles." For a QSAR model to be used for regulatory purposes, it should meet the following 5 conditions:

A defined endpoint
An unambiguous algorithm
A defined domain of applicability
Appropriate measures of goodness-of-fit, robustness, and predictivity
A mechanistic interpretation, if possible

代码

文本

Basic Workflow of QSAR Modeling

Building an effective QSAR model mainly involves three steps:

Constructing a reasonable molecular representation, which converts molecular structures into computer-readable numerical representations;
Selecting a suitable machine learning model for the molecular representation and using existing molecule-property data to train the model;
Using the trained machine learning model to predict the properties of molecules with unknown properties.

Since molecular structures are not in a computer-readable format, we must first convert them into numerical vectors that can be read by computers. This allows for the selection of appropriate mathematical models based on these representations. We call this process molecular representation. Effective molecular representation and the choice of compatible mathematical models are the core of building quantitative structure-activity relationship models.

代码

文本

Molecular Representation

Molecular representation is a numerical depiction that includes molecular properties. Common molecular representation methods include molecular descriptors, fingerprints, SMILES strings, and molecular potential functions.

Wei, J., Chu, X., Sun, X. Y., Xu, K., Deng, H. X., Chen, J., ... & Lei, M. (2019). Machine learning in materials science. InfoMat, 1(3), 338-358.

In fact, the development of QSAR has evolved along with the increasing information content and changing forms of molecular representations, leading to the classification of QSAR models into 1D-QSAR, 2D-QSAR, and 3D-QSAR:

Different molecular representations have distinct numerical characteristics, requiring different machine learning/deep learning models for modeling. Next, we will demonstrate how to build 1D-QSAR, 2D-QSAR, and 3D-QSAR models through practical examples.

代码

文本

1D-QSAR Molecular Representation

Early quantitative structure-activity relationship models mostly used physicochemical properties of molecules, such as molecular weight, water solubility, and molecular surface area, as the method of representation. These physicochemical properties are known as molecular descriptors. This defines the 1D-QSAR stage.

At this stage, experienced scientists often rely on their domain knowledge to design molecular descriptors, constructing properties that may be related to the characteristic being studied. For example, if the goal is to predict whether a drug can pass through the blood-brain barrier, this property may be related to the drug's water solubility, molecular weight, polar surface area, and other physicochemical attributes. Scientists would include such attributes in the molecular descriptors.

During this period, due to limited access to computers or insufficient computational power, scientists often used simple mathematical models for modeling, such as linear regression and random forests. Since molecular representations constructed from descriptors are typically low-dimensional real-valued vectors, these mathematical models are well-suited for this kind of work.

代码

文本

[11]

from rdkit import Chem

from rdkit.Chem import Descriptors

def calculate_1dqsar_repr(smiles):

# Create a molecule object from the SMILES string

mol = Chem.MolFromSmiles(smiles)

# Calculate the molecular weight

mol_weight = Descriptors.MolWt(mol)

# Calculate the LogP value of the molecule

log_p = Descriptors.MolLogP(mol)

# Calculate the number of hydrogen bond donors in the molecule

num_h_donors = Descriptors.NumHDonors(mol)

# Calculate the number of hydrogen bond acceptors in the molecule

num_h_acceptors = Descriptors.NumHAcceptors(mol)

# Calculate the topological polar surface area (TPSA) of the molecule

tpsa = Descriptors.TPSA(mol)

# Calculate the number of rotatable bonds in the molecule

num_rotatable_bonds = Descriptors.NumRotatableBonds(mol)

# Calculate the number of aromatic rings in the molecule

num_aromatic_rings = Descriptors.NumAromaticRings(mol)

# Calculate the number of aliphatic rings in the molecule

num_aliphatic_rings = Descriptors.NumAliphaticRings(mol)

# Calculate the number of saturated rings in the molecule

num_saturated_rings = Descriptors.NumSaturatedRings(mol)

# Calculate the number of heteroatoms in the molecule

num_heteroatoms = Descriptors.NumHeteroatoms(mol)

# Calculate the number of valence electrons in the molecule

num_valence_electrons = Descriptors.NumValenceElectrons(mol)

# Calculate the number of radical electrons in the molecule

num_radical_electrons = Descriptors.NumRadicalElectrons(mol)

# Calculate the QED (quantitative estimation of drug-likeness) value of the molecule

qed = Descriptors.qed(mol)

# Return all calculated properties

return [mol_weight, log_p, num_h_donors, num_h_acceptors, tpsa, num_rotatable_bonds, num_aromatic_rings,

num_aliphatic_rings, num_saturated_rings, num_heteroatoms, num_valence_electrons, num_radical_electrons, qed]

# Apply the function to calculate 1D-QSAR molecular representation for training and testing data

train_data["1dqsar_mr"] = train_data["SMILES"].apply(calculate_1dqsar_repr)

test_data["1dqsar_mr"] = test_data["SMILES"].apply(calculate_1dqsar_repr)

代码

文本

[12]

print(train_data["1dqsar_mr"][:1].values.tolist())

[[464.87300000000016, 2.5531800000000002, 1, 10, 140.73000000000002, 8, 4, 0, 0, 12, 166, 0, 0.4159359067517256]]

代码

文本

[13]

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn.ensemble import RandomForestRegressor

from sklearn.neighbors import KNeighborsRegressor

from sklearn.tree import DecisionTreeRegressor

from sklearn.neural_network import MLPRegressor

from xgboost import XGBRegressor

from sklearn.linear_model import Ridge, Lasso, ElasticNet

from sklearn.svm import SVR

from sklearn.ensemble import GradientBoostingRegressor

from lightgbm import LGBMRegressor

from sklearn.metrics import mean_squared_error

# Convert training and testing data to NumPy arrays

train_x = np.array(train_data["1dqsar_mr"].values.tolist())

train_y = np.array(train_data["TARGET"].values.tolist())

test_x = np.array(test_data["1dqsar_mr"].values.tolist())

test_y = np.array(test_data["TARGET"].values.tolist())

# Define the list of regressors to use

regressors = [

("Linear Regression", LinearRegression()), # Linear regression model

("Ridge Regression", Ridge(random_state=42)), # Ridge regression model

("Lasso Regression", Lasso(random_state=42)), # Lasso regression model

("ElasticNet Regression", ElasticNet(random_state=42)), # ElasticNet regression model

("Support Vector", SVR()), # Support vector regression model

("K-Nearest Neighbors", KNeighborsRegressor()), # K-nearest neighbors regression model

("Decision Tree", DecisionTreeRegressor(random_state=42)), # Decision tree regression model

("Random Forest", RandomForestRegressor(random_state=42)), # Random forest regression model

("Gradient Boosting", GradientBoostingRegressor(random_state=42)), # Gradient boosting regression model

("XGBoost", XGBRegressor(random_state=42)), # XGBoost regression model

("LightGBM", LGBMRegressor(random_state=42)), # LightGBM regression model

("Multi-layer Perceptron", MLPRegressor( # Multi-layer perceptron (neural network) regression model

hidden_layer_sizes=(128,64,32),

learning_rate_init=0.0001,

activation='relu', solver='adam',

max_iter=10000, random_state=42)),

]

# Train and predict for each regressor, and calculate performance metrics

for name, regressor in regressors:

# Train the regressor

regressor.fit(train_x, train_y)

# Predict training data and testing data

pred_train_y = regressor.predict(train_x)

pred_test_y = regressor.predict(test_x)

# Add predictions to training and testing data

train_data[f"1D-QSAR-{name}_pred"] = pred_train_y

test_data[f"1D-QSAR-{name}_pred"] = pred_test_y

# Calculate performance metrics for testing data

mse = mean_squared_error(test_y, pred_test_y)

se = abs(test_y - pred_test_y)

results[f"1D-QSAR-{name}"] = {"MSE": mse, "error": se}

print(f"[1D-QSAR][{name}]\tMSE:{mse:.4f}")

[1D-QSAR][Linear Regression]	MSE:0.8857
[1D-QSAR][Ridge Regression]	MSE:0.8857
[1D-QSAR][Lasso Regression]	MSE:0.9286
[1D-QSAR][ElasticNet Regression]	MSE:0.9269
[1D-QSAR][Support Vector]	MSE:0.9398
[1D-QSAR][K-Nearest Neighbors]	MSE:0.9110
[1D-QSAR][Decision Tree]	MSE:1.0579
[1D-QSAR][Random Forest]	MSE:0.6052
[1D-QSAR][Gradient Boosting]	MSE:0.7607
[1D-QSAR][XGBoost]	MSE:0.6057
[1D-QSAR][LightGBM]	MSE:0.6426
[1D-QSAR][Multi-layer Perceptron]	MSE:0.9385

代码

文本

[14]

import matplotlib.pyplot as plt

import seaborn as sns

# Plot residuals

residuals_data = []

for name, result in results.items():

if name.startswith("1D-QSAR"):

model_residuals = pd.DataFrame({"Model": name, "Error": result["error"]})

residuals_data.append(model_residuals)

residuals_df = pd.concat(residuals_data, ignore_index=True)

residuals_df.sort_values(by="Error", ascending=True, inplace=True)

model_order = residuals_df.groupby("Model")["Error"].median().sort_values(ascending=True).index

# Use seaborn to draw the violin plot

plt.figure(figsize=(10, 7), dpi=150)

font = {'family': 'serif',

'color': 'black',

'weight': 'normal',

'size': 15}

sns.boxplot(y="Model", x="Error", data=residuals_df, order=model_order)

plt.xlabel("Abs Error", fontdict=font)

plt.ylabel("Models", fontdict=font)

plt.show()

<Figure size 1500x1050 with 1 Axes>

代码

文本

2D-QSAR Molecular Characterization

However, when facing the challenge of predicting molecular properties with unclear biochemical mechanisms, scientists may find it difficult to design effective molecular descriptors to characterize molecules, leading to the failure of QSAR model construction. Since molecular properties are largely determined by molecular structure, such as the functional groups present on the molecule, there is an interest in incorporating the bonding relationships of molecules into QSAR modeling. Thus, the field has entered the stage of 2D-QSAR.

One of the earlier proposed methods is the molecular fingerprint method, such as Morgan fingerprints, which characterizes molecules by traversing the bonding relationships of each atom and its surrounding atoms. To meet the requirement that molecules of different sizes can be represented by numerical vectors of the same length, molecular fingerprints often use hashing operations to ensure uniform vector length, resulting in high-dimensional 0/1 vectors. In this scenario, scientists typically choose machine learning methods that handle high-dimensional sparse vectors well, such as support vector machines and fully connected neural networks, for model construction.

With the development of AI models, deep learning models capable of handling sequence data (e.g., text) like Recurrent Neural Networks (RNN), image data like Convolutional Neural Networks (CNN), and unstructured graph data like Graph Neural Networks (GNN) have been proposed and applied. QSAR models have also been constructed to fit molecular representations based on the data characteristics these models can handle. For example, SMILES string representations of molecules have been applied in RNN modeling, 2D images of molecules in CNN modeling, and the bonding topological structure of molecules converted into graphs in GNN modeling, leading to the development of a series of QSAR modeling methods.

Overall, in the 2D-QSAR stage, various methods are utilized to analyze the bonding relationships (topological structure) of molecules to model and predict molecular properties.

代码

文本

[15]

import numpy as np

from rdkit.Chem import AllChem

def calculate_2dqsar_repr(smiles):

# Convert the SMILES string to an RDKit molecule object

mol = Chem.MolFromSmiles(smiles)

# Calculate the Morgan fingerprint (radius 3, length 512 bits)

fp = AllChem.GetMorganFingerprintAsBitVect(mol, 3, nBits=512)

# Return the fingerprint as a numpy array

return np.array(fp)

# Apply the function to calculate 2D-QSAR molecular representation for training and testing data

train_data["2dqsar_mr"] = train_data["SMILES"].apply(calculate_2dqsar_repr)

test_data["2dqsar_mr"] = test_data["SMILES"].apply(calculate_2dqsar_repr)

代码

文本

[16]

print(train_data["2dqsar_mr"][:1].values.tolist())

[array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1])]

代码

文本

[17]

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn.ensemble import RandomForestRegressor

from sklearn.neighbors import KNeighborsRegressor

from sklearn.tree import DecisionTreeRegressor

from sklearn.neural_network import MLPRegressor

from xgboost import XGBRegressor

from sklearn.linear_model import Ridge, Lasso, ElasticNet

from sklearn.svm import SVR

from sklearn.ensemble import GradientBoostingRegressor

from lightgbm import LGBMRegressor

from sklearn.metrics import mean_squared_error

# Convert training and testing data to NumPy arrays

train_x = np.array(train_data["2dqsar_mr"].values.tolist())

train_y = np.array(train_data["TARGET"].values.tolist())

test_x = np.array(test_data["2dqsar_mr"].values.tolist())