Quantitative Structure-Activity Relationship (QSAR) Model from 0 to 1 & Uni-Mol Introductory Practice. Classification Tasks
©️ Copyright 2023 @ Authors
Author:
Hang Zheng (郑行) 📨
Date: 2023-06-06
License: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Quick Start: Click the Start Connection button above, select the unimol-qsar:0703 image and any GPU node configuration, and wait a moment to run.
In recent years, Artificial Intelligence (AI) has been developing at an unprecedented speed, bringing significant breakthroughs and transformations to various fields.
In fact, in the field of drug development, drug scientists have been using a series of mathematical and statistical methods to aid the drug development process since the last century. Based on the structure of drug molecules, they construct mathematical simulations to predict the biochemical activity of drugs. This method is known as Quantitative Structure-Activity Relationship (QSAR). QSAR models have continued to evolve with the deepening research on drug molecules and the introduction of more AI methods.
It can be said that QSAR models are a good microcosm of the development of the AI for Science field. In this Notebook, we will introduce the construction methods of different types of QSAR models in the form of case studies.
Introduction
Quantitative Structure-Activity Relationship (QSAR) is a method that studies the quantitative relationship between the chemical structure of compounds and their biological activity. It is one of the most important tools in Computer-Aided Drug Design (CADD). QSAR aims to establish mathematical models to relate molecular structures with their biochemical and physicochemical properties, helping drug scientists to make rational predictions about the properties of new drug molecules.
Building an effective QSAR model involves several steps:
- Constructing a reasonable molecular representation, which converts molecular structures into computer-readable numerical representations;
- Selecting a suitable machine learning model for the molecular representation and using existing molecule-property data to train the model;
- Using the trained machine learning model to predict the properties of molecules with unknown properties.
The development of QSAR models has evolved with the progression of molecular representation techniques and the corresponding upgrades in machine learning models. In this notebook, we will introduce the construction methods of different types of QSAR models through case studies.
Let's Prepare Some Data!
To better guide everyone through the process of building QSAR models, we will use the BACE-1 target molecular activity prediction task as a demonstration case.
We can start by downloading the BACE-1 dataset:
--2024-10-24 06:23:09-- https://bohrium-example.oss-cn-zhangjiakou.aliyuncs.com/notebook/bace_classification/train.csv Resolving ga.dp.tech (ga.dp.tech)... 10.255.254.37, 10.255.254.18, 10.255.254.7 Connecting to ga.dp.tech (ga.dp.tech)|10.255.254.37|:8118... connected. Proxy request sent, awaiting response... 200 OK Length: 81046 (79K) [text/csv] Saving to: ‘./datasets/BACE_train.csv’ ./datasets/BACE_tra 100%[===================>] 79.15K --.-KB/s in 0.006s 2024-10-24 06:23:09 (12.3 MB/s) - ‘./datasets/BACE_train.csv’ saved [81046/81046] --2024-10-24 06:23:10-- https://bohrium-example.oss-cn-zhangjiakou.aliyuncs.com/notebook/bace_classification/test.csv Resolving ga.dp.tech (ga.dp.tech)... 10.255.254.37, 10.255.254.18, 10.255.254.7 Connecting to ga.dp.tech (ga.dp.tech)|10.255.254.37|:8118... connected. Proxy request sent, awaiting response... 200 OK Length: 22128 (22K) [text/csv] Saving to: ‘./datasets/BACE_test.csv’ ./datasets/BACE_tes 100%[===================>] 21.61K --.-KB/s in 0.002s 2024-10-24 06:23:10 (12.9 MB/s) - ‘./datasets/BACE_test.csv’ saved [22128/22128]
Next, we can take a look at the composition of this dataset:
------ train data ------ SMILES TARGET 0 CN1C(=O)[C@@](c2ccc(OC(F)F)cc2)(c2ccc(F)c(C#CC... 1 1 CN1C(=O)[C@@](c2ccc(OC(F)F)cc2)(c2cccc(C#CCF)c... 1 2 CN1C(=O)[C@@](c2ccc(OC(F)F)cc2)(c2cccc(C#CCCF)... 1 3 CN1C(=O)[C@@](c2ccc(OC(F)F)cc2)(c2cccc(C#CCCO)... 1 4 CN1C(=O)[C@@](c2ccc(OC(F)F)cc2)(c2cccc(C#CCCCC... 1 ... ... ... 1205 CN1C(=O)C(c2ccncc2)(c2cccc(-c3ccoc3)c2)N=C1N 0 1206 CC(=O)N[C@@H](Cc1cc(F)cc(F)c1)[C@H](O)C[NH2+]C... 0 1207 O=C(C1C[NH2+]CC1c1ccccc1Br)N1CCC(c2ccccc2)CC1c... 0 1208 Nc1cccc(Cn2c(-c3ccc(Oc4ccncc4)cc3)ccc2-c2ccccc... 0 1209 CC1(C)Cc2cc(Cl)ccc2C(NC(Cc2cscc2CCC2CC2)C(=O)[... 0 [1210 rows x 2 columns] POSITIVE: 515 NEGETIVE: 695 ------ test data ------ SMILES TARGET 0 Cc1ccccc1-c1ccc2nc(N)c(CC(C)C(=O)NCC3CCOCC3)cc2c1 1 1 CC(C)(C)Cc1cnc2c(c1)C([NH2+]CC(O)C1Cc3cccc(c3)... 1 2 COCC(=O)NC(Cc1cccc(-c2ncco2)c1)C(O)C[NH2+]C1CC... 1 3 COCC(=O)NC(Cc1cccc(-c2ccccn2)c1)C(O)C[NH2+]C1C... 1 4 CCn1cc2c3c(cc(C(=O)NC(Cc4ccccc4)[C@H](O)C[NH2+... 1 .. ... ... 298 CCn1cc2c3c(cc(C(=O)NC(Cc4ccccc4)C(=O)C[NH2+]C4... 1 299 CN1CCC(C)(c2cc(NC(=O)c3ccc(Cl)cn3)ccc2F)N=C1N 1 300 COc1cccc(C[NH2+]C[C@H](O)[C@H](Cc2cc(F)cc(F)c2... 1 301 CC(C)(C)c1cccc(C[NH2+]C2CS(=O)(=O)CC(Cc3cc(F)c... 1 302 COc1cccc(C[NH2+]CC(O)C(Cc2ccccc2)NC(=O)C(CCc2c... 1 [303 rows x 2 columns] POSITIVE: 176 NEGETIVE: 127
As we can see, in the BACE dataset:
- Molecules are represented by SMILES strings.
- The task is a binary classification task, where TARGET == 1 indicates that the molecule is active against the BACE-1 target, and TARGET == 0 indicates that the molecule is not active against the BACE-1 target.
This is a common molecular property prediction task. Alright, let's set this dataset aside for now. Next, let's officially begin our exploration.
A Brief History of QSAR
Quantitative Structure-Activity Relationship (QSAR) is a method used to study the quantitative relationship between the chemical structure of compounds and their biological activity. It is one of the most important tools in Computer-Aided Drug Design (CADD). QSAR aims to establish mathematical models that link molecular structures with their biochemical and physicochemical properties, helping drug scientists make rational predictions about the properties of new drug molecules.
QSAR evolved from Structure-Activity Relationship (SAR) analysis. The origins of SAR can be traced back to the late 19th century when chemists began studying the relationship between the structure of compounds and their biological activity. German chemist Paul Ehrlich (1854-1915) proposed the "lock-and-key" hypothesis, suggesting that the interaction between compounds (the key) and biological targets (the lock) depends on their spatial compatibility. As scientists gained a deeper understanding of molecular interactions, they found that, besides spatial matching, the surface properties of the target (such as hydrophobicity and electrophilicity) and the corresponding structural properties of the ligand were also crucial. This led to the development of various methods to evaluate the relationship between structural characteristics and binding affinity, known as structure-activity relationships.
However, SAR methods mainly relied on chemists' experience and intuition, lacking a solid theoretical foundation and unified analytical approach. To overcome these limitations, scientists began using mathematical and statistical methods in the 1960s to perform quantitative analysis of the relationship between molecular structure and biological activity.
The earliest QSAR model dates back to 1868 when chemist Alexander Crum Brown and physiologist Thomas R. Fraser studied the relationship between compound structure and biological activity. While examining the biological effects of methylation of the basic nitrogen atom in alkaloids, they proposed that the physiological activity of a compound depends on its composition, expressed as: , where represents biological activity and represents compound composition. This Crum-Brown equation laid the foundation for subsequent QSAR research.
Over time, many QSAR models were introduced, such as Hammett's model linking organic compound toxicity to molecular electronic properties, and Taft's steric parameter model. In 1964, Hansch and Fujita proposed the well-known Hansch model, which posits that a molecule's biological activity is mainly determined by its hydrophobic effect (), steric effect (), and electronic effect (), assuming these effects are independently additive. The complete form is expressed as: . The Hansch model was the first to quantitatively describe the relationship between chemical information and drug biological activity, providing a practical framework for subsequent QSAR research and marking the transition from blind drug design to rational drug design.
Today, QSAR has matured into a well-established field involving various computational methods and techniques. Recently, with the rapid development of machine learning and artificial intelligence, QSAR methods have been further expanded and applied. For example, deep learning techniques have been employed to build QSAR models, improving their predictive power and accuracy. Additionally, QSAR approaches have found extensive applications in environmental science, materials science, and other fields, demonstrating their strong potential and broad applicability.
Basic Requirements for QSAR Modeling
At an international conference held in Setubal, Portugal, in 2002, scientists proposed several rules regarding the validity of QSAR models, known as the "Setubal Principles." These rules were further refined in November 2004 and officially named the "OECD Principles." For a QSAR model to be used for regulatory purposes, it should meet the following 5 conditions:
- A defined endpoint
- An unambiguous algorithm
- A defined domain of applicability
- Appropriate measures of goodness-of-fit, robustness, and predictivity
- A mechanistic interpretation, if possible
Basic Workflow of QSAR Modeling
Building an effective QSAR model mainly involves three steps:
- Constructing a reasonable molecular representation, which converts molecular structures into computer-readable numerical representations;
- Selecting a suitable machine learning model for the molecular representation and using existing molecule-property data to train the model;
- Using the trained machine learning model to predict the properties of molecules with unknown properties.
Since molecular structures are not in a computer-readable format, we must first convert them into numerical vectors that can be read by computers. This allows for the selection of appropriate mathematical models based on these representations. We call this process molecular representation. Effective molecular representation and the choice of compatible mathematical models are the core of building quantitative structure-activity relationship models.
Molecular Representation
Molecular representation is a numerical depiction that includes molecular properties. Common molecular representation methods include molecular descriptors, fingerprints, SMILES strings, and molecular potential functions.
Wei, J., Chu, X., Sun, X. Y., Xu, K., Deng, H. X., Chen, J., ... & Lei, M. (2019). Machine learning in materials science. InfoMat, 1(3), 338-358.
In fact, the development of QSAR has evolved along with the increasing information content and changing forms of molecular representations, leading to the classification of QSAR models into 1D-QSAR, 2D-QSAR, and 3D-QSAR:
Different molecular representations have distinct numerical characteristics, requiring different machine learning/deep learning models for modeling. Next, we will demonstrate how to build 1D-QSAR, 2D-QSAR, and 3D-QSAR models through practical examples.
1D-QSAR Molecular Representation
Early quantitative structure-activity relationship models mostly used physicochemical properties of molecules, such as molecular weight, water solubility, and molecular surface area, as the method of representation. These physicochemical properties are known as molecular descriptors. This defines the 1D-QSAR stage.
At this stage, experienced scientists often rely on their domain knowledge to design molecular descriptors, constructing properties that may be related to the characteristic being studied. For example, if the goal is to predict whether a drug can pass through the blood-brain barrier, this property may be related to the drug's water solubility, molecular weight, polar surface area, and other physicochemical attributes. Scientists would include such attributes in the molecular descriptors.
During this period, due to limited access to computers or insufficient computational power, scientists often used simple mathematical models for modeling, such as linear regression and random forests. Since molecular representations constructed from descriptors are typically low-dimensional real-valued vectors, these mathematical models are well-suited for this kind of work.
[433.40500000000026, 3.558600000000002, 1, 4, 67.92, 6, 2, 1, 0, 9, 162, 0, 0.4304551686487377]
2D-QSAR Molecular Characterization
However, when facing the challenge of predicting molecular properties with unclear biochemical mechanisms, scientists may find it difficult to design effective molecular descriptors to characterize molecules, leading to the failure of QSAR model construction. Since molecular properties are largely determined by molecular structure, such as the functional groups present on the molecule, there is an interest in incorporating the bonding relationships of molecules into QSAR modeling. Thus, the field has entered the stage of 2D-QSAR.
One of the earlier proposed methods is the molecular fingerprint method, such as Morgan fingerprints, which characterizes molecules by traversing the bonding relationships of each atom and its surrounding atoms. To meet the requirement that molecules of different sizes can be represented by numerical vectors of the same length, molecular fingerprints often use hashing operations to ensure uniform vector length, resulting in high-dimensional 0/1 vectors. In this scenario, scientists typically choose machine learning methods that handle high-dimensional sparse vectors well, such as support vector machines and fully connected neural networks, for model construction.
With the development of AI models, deep learning models capable of handling sequence data (e.g., text) like Recurrent Neural Networks (RNN), image data like Convolutional Neural Networks (CNN), and unstructured graph data like Graph Neural Networks (GNN) have been proposed and applied. QSAR models have also been constructed to fit molecular representations based on the data characteristics these models can handle. For example, SMILES string representations of molecules have been applied in RNN modeling, 2D images of molecules in CNN modeling, and the bonding topological structure of molecules converted into graphs in GNN modeling, leading to the development of a series of QSAR modeling methods.
Overall, in the 2D-QSAR stage, various methods are utilized to analyze the bonding relationships (topological structure) of molecules to model and predict molecular properties.
3D-QSAR Molecular Characterization
However, due to the presence of intermolecular and intramolecular interactions, molecules with similar topological structures may adopt different conformations in various environments. The conformation of each molecule in different environments and the corresponding energy levels determine the true nature of the molecule. Therefore, scientists aim to incorporate the three-dimensional structure of molecules into QSAR modeling to enhance the ability to predict molecular properties in specific scenarios. This stage is referred to as the 3D-QSAR stage.
The Comparative Molecular Field Analysis (CoFMA) is a widely used 3D-QSAR model. It calculates the forces (i.e., force fields) at various positions in the space where the molecule exists (usually by selecting positions through a grid method) to characterize the three-dimensional structure of the molecule. Of course, there are other beneficial attempts in the field, including characterization methods through electron density, three-dimensional molecular images, or adding geometric information to molecular graphs.
To handle such high-dimensional spatial information, scientists often choose deep learning methods such as deeper FCNN, 3D-CNN, GNN, etc., for modeling.
Uni-Mol Molecular Representation Learning and Pretraining Framework
Pretraining Model
One of the main challenges in QSAR modeling within the field of drug development is the limited amount of data. Due to the high cost and experimental difficulty of obtaining drug activity data, there is often a lack of labeled data. Insufficient data affects the model's predictive ability, as it may be difficult for the model to capture enough information to describe the relationship between compound structure and biological activity.
Faced with this situation of insufficient labeled data, the pretrain-finetune approach has become a common solution in more mature fields of machine learning, such as natural language processing (NLP) and computer vision (CV). Pretraining involves training the model on a large amount of unlabeled data through self-supervised learning, allowing the model to gain basic information and general capabilities. The model is then fine-tuned on a smaller set of labeled data through supervised learning to equip it with specific problem-solving abilities.
For example, if I want to perform image recognition of cats and dogs but lack sufficient labeled data, I can first pretrain the model using a large set of unlabeled images, enabling it to learn basic concepts of lines, shapes, and contours. Afterward, I can use supervised learning with cat and dog images, allowing the model to quickly learn to distinguish between cats and dogs based on contour information.
The pretraining approach can effectively utilize large amounts of easily accessible unlabeled data to improve the model's generalization ability and predictive performance. In QSAR modeling, we can also leverage the concept of pretraining to address issues related to data quantity and quality.
Introduction to Uni-Mol
Uni-Mol is a universal molecular representation learning framework based on 3D molecular structures, released by DeepModeling in May 2022. Uni-Mol takes 3D molecular structures as model input and uses around 200 million small molecule conformations and 3 million protein surface cavity structures. It pretrains the model using two self-supervised tasks: atom type restoration and atom coordinate restoration.
Uni-Mol Paper: https://openreview.net/forum?id=6K2RM6wVqKu
Open-source Code: https://github.com/dptech-corp/Uni-Mol
The representation learning from 3D information and the effective pretraining approach allow Uni-Mol to outperform SOTA (state of the art) models in almost all downstream tasks related to drug molecules and protein pockets. Uni-Mol can directly handle tasks such as molecular conformation generation and protein-ligand binding pose prediction, surpassing existing solutions. The paper was accepted at the top machine learning conference ICLR 2023.
Next, we will use Uni-Mol to build a BACE-1 molecular activity prediction task:
Overview of Results
Finally, we can make a horizontal comparison of the predictive performance of 1D-QSAR, 2D-QSAR, and 3D-QSAR combined with different models, as well as Uni-Mol on the same dataset.