001 从ChEMBL获取化合物数据
🏃🏻 快速开始
您可以直接在 Bohrium Notebook 上执行此文档。首先,请点击位于界面顶部的 开始连接 按钮,然后选择 bohrium-notebook:05-31 镜像并选择合适的的机器配置,稍等片刻即可开始运行。
📖 来源
本 Notebook 来自 https://github.com/volkamerlab/teachopencadd 。
说明:涉及下载的代码在运行时,需要花费较多时间
Talktorial T001: This talktorial is part of the TeachOpenCADD pipeline described in the first TeachOpenCADD paper, comprising of talktorials T001-T010.
Aim of this talktorial
In this notebook, we will learn more about the ChEMBL database and how to extract data from ChEMBL, i.e. (compound, activity data) pairs for a target of interest. These data sets can be used for many cheminformatics tasks, such as similarity search, clustering or machine learning.
Our work here will include finding compounds that were tested against a certain target and filtering available bioactivity data.
Contents in Theory
- ChEMBL database
- ChEMBL web services
- ChEMBL web resource client
- Compound activity measures
- IC50 measure
- pIC50 value
Contents in Practical
Goal: Get a list of compounds with bioactivity data for a given target
- Connect to ChEMBL database
- Get target data (example: EGFR kinase)
- Fetch and download target data
- Select target ChEMBL ID
- Get bioactivity data
- Fetch and download bioactivity data for targets
- Preprocess and filter bioactivity data
- Get compound data
- Fetch and download compound data
- Preprocess and filter compound data
- Output bioactivity-compound data
- Merge bioactivity and compound data, and add pIC50 values
- Draw molecules with highest pIC50
- Freeze bioactivity data to ChEMBL 27
- Write output file
References
- ChEMBL bioactivity database: Gaulton et al., Nucleic Acids Res. (2017), 45(Database issue), D945–D954
- ChEMBL web services: Davies et al., Nucleic Acids Res. (2015), 43, 612-620
- ChEMBL web-interface
- GitHub ChEMBL web rescource client
- The EBI RDF platform: Jupp et al., Bioinformatics (2014), 30(9), 1338-9
- Info on half maximal inhibitory concentration: (p)IC50
- UniProt website
Theory
ChEMBL database
"ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs." (ChEMBL website)
- Open large-scale bioactivity database
- Current data content (as of 09.2020, ChEMBL 27):
- >1.9 million distinct compounds
- >16 million activity values
- Assays are mapped to ~13,000 targets
- Data sources include scientific literature, PubChem bioassays, Drugs for Neglected Diseases Initiative (DNDi), BindingDB database, ...
- ChEMBL data can be accessed via a web-interface, the EBI-RDF platform and the ChEMBL web rescource client
ChEMBL web services
- RESTful web service
- ChEMBL web service version 2.x resource schema:
Figure 1:
"ChEMBL web service schema diagram. The oval shapes represent ChEMBL web service resources and the line between two resources indicates that they share a common attribute. The arrow direction shows where the primary information about a resource type can be found. A dashed line indicates the relationship between two resources behaves differently. For example, the Image
resource provides a graphical-based representation of a Molecule
."
Figure and description are taken from: Nucleic Acids Res. (2015), 43, 612-620.
ChEMBL web resource client
- Python client library for accessing ChEMBL data
- Handles interaction with the HTTPS protocol
- Lazy evaluation of results -> reduced number of network requests
Compound activity measures
IC50 measure
- Half maximal inhibitory concentration
- Indicates how much of a particular drug or other substance is needed to inhibit a given biological process by half
Figure 2: Visual demonstration of how to derive an IC50 value:
(i) Arrange inhibition data on y-axis and log(concentration) on x-axis. (ii) Identify maximum and minimum inhibition. (iii) The IC50 is the concentration at which the curve passes through the 50% inhibition level. Figure Figure "Example IC50 curve demonstrating visually how IC50 is derived" by JesseAlanGordon is licensed under CC BY-SA 3.0. by JesseAlanGordon is licensed under CC BY-SA 3.0.
pIC50 value
- To facilitate the comparison of IC50 values, which have a large value range and are given in different units (M, nM, ...), often pIC50 values are used
- The pIC50 is the negative log of the IC50 value when converted to molar units: , where is specified in units of M
- Higher pIC50 values indicate exponentially greater potency of the drug
- Note that the conversion can be adapted to the respective IC50 unit, e.g. for nM:
Other activity measures:
Besides, IC50 and pIC50, other bioactivity measures are used, such as the equilibrium constant KI and the half-maximal effective concentration EC50.
Practical
In the following, we want to download all molecules that have been tested against our target of interest, the epidermal growth factor receptor (EGFR) kinase.
Connect to ChEMBL database
First, the ChEMBL web resource client as well as other Python libraries are imported.
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting chembl_webresource_client Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ca/7d/5addc419def00592eb52d2c94916a322d4f6b307a1d87adb9cc91aea0892/chembl_webresource_client-0.10.8-py3-none-any.whl (55 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.2/55.2 kB 218.7 kB/s eta 0:00:00a 0:00:01 Collecting requests-cache~=0.7.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9a/8b/1a0edf935b70df1be64c5b2988e0322b14a697f89a7818dc12cae3d327f8/requests_cache-0.7.5-py3-none-any.whl (39 kB) Requirement already satisfied: requests>=2.18.4 in /opt/conda/lib/python3.8/site-packages (from chembl_webresource_client) (2.28.2) Requirement already satisfied: urllib3 in /opt/conda/lib/python3.8/site-packages (from chembl_webresource_client) (1.26.14) Requirement already satisfied: easydict in /opt/conda/lib/python3.8/site-packages (from chembl_webresource_client) (1.10) Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.8/site-packages (from requests>=2.18.4->chembl_webresource_client) (3.0.1) Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests>=2.18.4->chembl_webresource_client) (3.4) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests>=2.18.4->chembl_webresource_client) (2022.12.7) Collecting attrs<22.0,>=21.2 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/be/be/7abce643bfdf8ca01c48afa2ddf8308c2308b0c3b239a44e57d020afa0ef/attrs-21.4.0-py2.py3-none-any.whl (60 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.6/60.6 kB 681.8 kB/s eta 0:00:00a 0:00:01 Requirement already satisfied: pyyaml>=5.4 in /opt/conda/lib/python3.8/site-packages (from requests-cache~=0.7.0->chembl_webresource_client) (6.0) Collecting url-normalize<2.0,>=1.4 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/65/1c/6c6f408be78692fc850006a2b6dea37c2b8592892534e09996e401efc74b/url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB) Collecting itsdangerous>=2.0.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/68/5f/447e04e828f47465eeab35b5d408b7ebaaaee207f48b7136c5a7267a30ae/itsdangerous-2.1.2-py3-none-any.whl (15 kB) Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from url-normalize<2.0,>=1.4->requests-cache~=0.7.0->chembl_webresource_client) (1.16.0) Installing collected packages: url-normalize, itsdangerous, attrs, requests-cache, chembl_webresource_client Attempting uninstall: itsdangerous Found existing installation: itsdangerous 1.1.0 Uninstalling itsdangerous-1.1.0: Successfully uninstalled itsdangerous-1.1.0 Attempting uninstall: attrs Found existing installation: attrs 22.1.0 Uninstalling attrs-22.1.0: Successfully uninstalled attrs-22.1.0 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. pytest 3.6.4 requires pluggy<0.8,>=0.5, but you have pluggy 1.0.0 which is incompatible. flask 1.1.4 requires itsdangerous<2.0,>=0.24, but you have itsdangerous 2.1.2 which is incompatible. flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.1.2 which is incompatible. Successfully installed attrs-21.4.0 chembl_webresource_client-0.10.8 itsdangerous-2.1.2 requests-cache-0.7.5 url-normalize-1.4.3 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Next, we create resource objects for API access.
chembl_webresource_client.query_set.QuerySet
Get target data (EGFR kinase)
- Get UniProt ID of the target of interest (EGFR kinase: P00533) from UniProt website
- Use UniProt ID to get target information
Select a different UniProt ID, if you are interested in another target.
Fetch target data from ChEMBL
The type of the targets is "<class 'chembl_webresource_client.query_set.QuerySet'>"
Download target data from ChEMBL
The results of the query are stored in targets
, a QuerySet
, i.e. the results are not fetched from ChEMBL until we ask for it (here using pandas.DataFrame.from_records
).
More information about the QuerySet
datatype:
QuerySets are lazy – the act of creating a QuerySet does not involve any database activity. You can stack filters together all day long, and Django will actually not run the query until the QuerySet is evaluated. (querysets-are-lazy)
organism | pref_name | target_chembl_id | target_type | |
---|---|---|---|---|
0 | Homo sapiens | Epidermal growth factor receptor erbB1 | CHEMBL203 | SINGLE PROTEIN |
1 | Homo sapiens | Epidermal growth factor receptor erbB1 | CHEMBL203 | SINGLE PROTEIN |
2 | Homo sapiens | Epidermal growth factor receptor and ErbB2 (HE... | CHEMBL2111431 | PROTEIN FAMILY |
3 | Homo sapiens | Epidermal growth factor receptor | CHEMBL2363049 | PROTEIN FAMILY |
4 | Homo sapiens | MER intracellular domain/EGFR extracellular do... | CHEMBL3137284 | CHIMERIC PROTEIN |
5 | Homo sapiens | Protein cereblon/Epidermal growth factor receptor | CHEMBL4523680 | PROTEIN-PROTEIN INTERACTION |
6 | Homo sapiens | EGFR/PPP1CA | CHEMBL4523747 | PROTEIN-PROTEIN INTERACTION |
7 | Homo sapiens | VHL/EGFR | CHEMBL4523998 | PROTEIN-PROTEIN INTERACTION |
8 | Homo sapiens | Baculoviral IAP repeat-containing protein 2/Ep... | CHEMBL4802031 | PROTEIN-PROTEIN INTERACTION |
Select target (target ChEMBL ID)
After checking the entries, we select the first entry as our target of interest:
CHEMBL203
: It is a single protein and represents the human Epidermal growth factor receptor (EGFR, also named erbB1)
organism Homo sapiens pref_name Epidermal growth factor receptor erbB1 target_chembl_id CHEMBL203 target_type SINGLE PROTEIN Name: 0, dtype: object
Save selected ChEMBL ID.
The target ChEMBL ID is CHEMBL203
Get bioactivity data
Now, we want to query bioactivity data for the target of interest.
Fetch bioactivity data for the target from ChEMBL
In this step, we fetch the bioactivity data and filter it to only consider
- human proteins,
- bioactivity type IC50,
- exact measurements (relation
'='
), and - binding data (assay type
'B'
).
Length and type of bioactivities object: 10420, <class 'chembl_webresource_client.query_set.QuerySet'>
Each entry in our bioactivity set holds the following information:
Length and type of first element: 13, <class 'dict'>
{'activity_id': 32260, 'assay_chembl_id': 'CHEMBL674637', 'assay_description': 'Inhibitory activity towards tyrosine phosphorylation for the epidermal growth factor-receptor kinase', 'assay_type': 'B', 'molecule_chembl_id': 'CHEMBL68920', 'relation': '=', 'standard_units': 'nM', 'standard_value': '41.0', 'target_chembl_id': 'CHEMBL203', 'target_organism': 'Homo sapiens', 'type': 'IC50', 'units': 'uM', 'value': '0.041'}
Download bioactivity data from ChEMBL
Finally, we download the QuerySet
in the form of a pandas
DataFrame
.
Note: This step should not take more than 2 minutes, if so try to rerun all cells starting from "Fetch bioactivity data for the target from ChEMBL" or read this message below:
Load a local version of the data (in case you encounter any problems while fetching the data)
If you experience difficulties to query the ChEMBL database, we also provide the resulting dataframe you will construct in the cell below. If you want to use the saved version, use the following code instead to obtain bioactivities_df
:
# replace first line in cell below with this other line
bioactivities_df = pd.read_csv(DATA / "EGFR_bioactivities_CHEMBL27.csv.zip", index_col=0)
DataFrame shape: (10420, 13)
activity_id | assay_chembl_id | assay_description | assay_type | molecule_chembl_id | relation | standard_units | standard_value | target_chembl_id | target_organism | type | units | value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 32260 | CHEMBL674637 | Inhibitory activity towards tyrosine phosphory... | B | CHEMBL68920 | = | nM | 41.0 | CHEMBL203 | Homo sapiens | IC50 | uM | 0.041 |
1 | 32267 | CHEMBL674637 | Inhibitory activity towards tyrosine phosphory... | B | CHEMBL69960 | = | nM | 170.0 | CHEMBL203 | Homo sapiens | IC50 | uM | 0.17 |
2 | 32680 | CHEMBL677833 | In vitro inhibition of Epidermal growth factor... | B | CHEMBL137635 | = | nM | 9300.0 | CHEMBL203 | Homo sapiens | IC50 | uM | 9.3 |
3 | 32770 | CHEMBL674643 | Inhibitory concentration of EGF dependent auto... | B | CHEMBL306988 | = | nM | 500000.0 | CHEMBL203 | Homo sapiens | IC50 | uM | 500.0 |
4 | 32772 | CHEMBL674643 | Inhibitory concentration of EGF dependent auto... | B | CHEMBL66879 | = | nM | 3000000.0 | CHEMBL203 | Homo sapiens | IC50 | uM | 3000.0 |
Note that the first two rows describe the same bioactivity entry; we will remove such artifacts later during the deduplication step. Note also that we have columns for standard_units
/units
and standard_values
/values
; in the following, we will use the standardized columns (standardization by ChEMBL), and thus, we drop the other two columns.
If we used the units
and values
columns, we would need to convert all values with many different units to nM:
array(['uM', 'nM', 'pM', 'M', "10'3 uM", "10'1 ug/ml", 'ug ml-1', "10'-1microM", "10'1 uM", "10'-1 ug/ml", "10'-2 ug/ml", "10'2 uM", "10'-3 ug/ml", "10'-2microM", '/uM', "10'-6g/ml", 'mM', 'umol/L', 'nmol/L', "10'-10M", "10'-7M", 'nmol', '10^-8M', 'µM'], dtype=object)
activity_id | assay_chembl_id | assay_description | assay_type | molecule_chembl_id | relation | standard_units | standard_value | target_chembl_id | target_organism | type | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 32260 | CHEMBL674637 | Inhibitory activity towards tyrosine phosphory... | B | CHEMBL68920 | = | nM | 41.0 | CHEMBL203 | Homo sapiens | IC50 |
1 | 32267 | CHEMBL674637 | Inhibitory activity towards tyrosine phosphory... | B | CHEMBL69960 | = | nM | 170.0 | CHEMBL203 | Homo sapiens | IC50 |
2 | 32680 | CHEMBL677833 | In vitro inhibition of Epidermal growth factor... | B | CHEMBL137635 | = | nM | 9300.0 | CHEMBL203 | Homo sapiens | IC50 |
3 | 32770 | CHEMBL674643 | Inhibitory concentration of EGF dependent auto... | B | CHEMBL306988 | = | nM | 500000.0 | CHEMBL203 | Homo sapiens | IC50 |
4 | 32772 | CHEMBL674643 | Inhibitory concentration of EGF dependent auto... | B | CHEMBL66879 | = | nM | 3000000.0 | CHEMBL203 | Homo sapiens | IC50 |
Preprocess and filter bioactivity data
- Convert
standard_value
's datatype fromobject
tofloat
- Delete entries with missing values
- Keep only entries with
standard_unit == nM
- Delete duplicate molecules
- Reset
DataFrame
index - Rename columns
1. Convert datatype of "standard_value" from "object" to "float"
The field standard_value
holds standardized (here IC50) values. In order to make these values usable in calculations later on, convert values to floats.
activity_id int64 assay_chembl_id object assay_description object assay_type object molecule_chembl_id object relation object standard_units object standard_value object target_chembl_id object target_organism object type object dtype: object
activity_id int64 assay_chembl_id object assay_description object assay_type object molecule_chembl_id object relation object standard_units object standard_value float64 target_chembl_id object target_organism object type object dtype: object
2. Delete entries with missing values
Use the parameter inplace=True
to drop values in the current DataFrame
directly.
DataFrame shape: (10419, 11)
3. Keep only entries with "standard_unit == nM"
We only want to keep bioactivity entries in nM
, thus we remove all entries with other units.
Units in downloaded data: ['nM' 'ug.mL-1' '/uM' 'µM'] Number of non-nM entries: 70
Units after filtering: ['nM']
DataFrame shape: (10349, 11)
4. Delete duplicate molecules
Sometimes the same molecule (molecule_chembl_id
) has been tested more than once, in this case, we only keep the first one.
Note other choices could be to keep the one with the best value or a mean value of all assay results for the respective compound.
DataFrame shape: (6823, 11)
5. Reset "DataFrame" index
Since we deleted some rows, but we want to iterate over the index later, we reset the index to be continuous.
activity_id | assay_chembl_id | assay_description | assay_type | molecule_chembl_id | relation | standard_units | standard_value | target_chembl_id | target_organism | type | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 32260 | CHEMBL674637 | Inhibitory activity towards tyrosine phosphory... | B | CHEMBL68920 | = | nM | 41.0 | CHEMBL203 | Homo sapiens | IC50 |
1 | 32267 | CHEMBL674637 | Inhibitory activity towards tyrosine phosphory... | B | CHEMBL69960 | = | nM | 170.0 | CHEMBL203 | Homo sapiens | IC50 |
2 | 32680 | CHEMBL677833 | In vitro inhibition of Epidermal growth factor... | B | CHEMBL137635 | = | nM | 9300.0 | CHEMBL203 | Homo sapiens | IC50 |
3 | 32770 | CHEMBL674643 | Inhibitory concentration of EGF dependent auto... | B | CHEMBL306988 | = | nM | 500000.0 | CHEMBL203 | Homo sapiens | IC50 |
4 | 32772 | CHEMBL674643 | Inhibitory concentration of EGF dependent auto... | B | CHEMBL66879 | = | nM | 3000000.0 | CHEMBL203 | Homo sapiens | IC50 |
6. Rename columns
activity_id | assay_chembl_id | assay_description | assay_type | molecule_chembl_id | relation | units | IC50 | target_chembl_id | target_organism | type | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 32260 | CHEMBL674637 | Inhibitory activity towards tyrosine phosphory... | B | CHEMBL68920 | = | nM | 41.0 | CHEMBL203 | Homo sapiens | IC50 |
1 | 32267 | CHEMBL674637 | Inhibitory activity towards tyrosine phosphory... | B | CHEMBL69960 | = | nM | 170.0 | CHEMBL203 | Homo sapiens | IC50 |
2 | 32680 | CHEMBL677833 | In vitro inhibition of Epidermal growth factor... | B | CHEMBL137635 | = | nM | 9300.0 | CHEMBL203 | Homo sapiens | IC50 |
3 | 32770 | CHEMBL674643 | Inhibitory concentration of EGF dependent auto... | B | CHEMBL306988 | = | nM | 500000.0 | CHEMBL203 | Homo sapiens | IC50 |
4 | 32772 | CHEMBL674643 | Inhibitory concentration of EGF dependent auto... | B | CHEMBL66879 | = | nM | 3000000.0 | CHEMBL203 | Homo sapiens | IC50 |
DataFrame shape: (6823, 11)
We now have a set of 5575 molecule ids with respective IC50 values for our target kinase.
Get compound data
We have a DataFrame
containing all molecules tested against EGFR (with the respective measured bioactivity).
Now, we want to get the molecular structures of the molecules that are linked to respective bioactivity ChEMBL IDs.
Fetch compound data from ChEMBL
Let's have a look at the compounds from ChEMBL which we have defined bioactivity data for: We fetch compound ChEMBL IDs and structures for the compounds linked to our filtered bioactivity data.
Download compound data from ChEMBL
Again, we want to export the QuerySet
object into a pandas.DataFrame
. Given the data volume, this can take some time. For that reason, we will first obtain the list of records through tqdm
, so we get a nice progress bar and some ETAs. We can then pass the list of compounds to the DataFrame.
Preprocess and filter compound data
- Remove entries with missing entries
- Delete duplicate molecules (by molecule_chembl_id)
- Get molecules with canonical SMILES
1. Remove entries with missing molecule structure entry
2. Delete duplicate molecules
3. Get molecules with canonical SMILES
So far, we have multiple different molecular structure representations. We only want to keep the canonical SMILES.
Sanity check: Remove all molecules without a canonical SMILES string.
Output (bioactivity-compound) data
Summary of compound and bioactivity data
Merge both datasets
Merge values of interest from bioactivities_df
and compounds_df
in an output_df
based on the compounds' ChEMBL IDs (molecule_chembl_id
), keeping the following columns:
- ChEMBL IDs:
molecule_chembl_id
- SMILES:
smiles
- units:
units
- IC50:
IC50
Add pIC50 values
As you can see the low IC50 values are difficult to read (values are distributed over multiple scales), which is why we convert the IC50 values to pIC50.
Draw compound data
Let's have a look at our collected data set.
First, we plot the pIC50 value distribution
In the next steps, we add a column for RDKit molecule objects to our DataFrame
and look at the structures of the molecules with the highest pIC50 values.
Show the three most active molecules, i.e. molecules with the highest pIC50 values.
Freeze output data to ChEMBL 27
This is a technical step: Usually, we would continue to work with the dataset that we just created (latest dataset).
However, here on the TeachOpenCADD platform, we prefer to freeze the dataset to a certain ChEMBL releases (i.e. ChEMBL 27), so that this talktorial and other talktorials downstream in our CADD pipeline do not change in the future (helping us to maintain the talktorials).
Note: If you prefer to run this notebook on the latest dataset or if you want to use it for another target, please comment the cell below.
Write output data to file
We want to use this bioactivity-compound dataset in the following talktorials, thus we save the data as csv
file.
Note that it is advisable to drop the molecule column (which only contains an image of the molecules) when saving the data.
Discussion
In this tutorial, we collected bioactivity data for our target of interest from the ChEMBL database. We filtered the data set in order to only contain molecules with measured IC50 bioactivity values.
Be aware that ChEMBL data originates from various sources. Compound data has been generated in different labs by different people all over the world. Therefore, we have to be cautious with the predictions we make using this data set. It is always important to consider the source of the data and consistency of data production assays when interpreting the results and determining how much confidence we have in our predictions.
In the next tutorials, we will filter our acquired data by Lipinski's rule of five and by unwanted substructures. Another important step would be to clean the molecular data. As this is not shown in any of our talktorials (yet), we would like to refer to the Standardiser library or MolVS as useful tools for this task.