Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
TeachOpenCADD | 019 激酶的相似性:氨基酸序列
化学信息学
TeachOpenCADD
化学信息学TeachOpenCADD
YangHe
发布于 2023-06-15
AI4SCUP-CNS-BBB(v1)

019 激酶的相似性:氨基酸序列

🏃🏻 快速开始
您可以直接在 Bohrium Notebook 上执行此文档。首先,请点击位于界面顶部的 开始连接 按钮,然后选择 bohrium-notebook:05-31 镜像并选择合适的的机器配置,稍等片刻即可开始运行。

📖 来源
本 Notebook 来自 https://github.com/volkamerlab/teachopencadd,由杨合 📨 修改搬运至 Bohrium Notebook。

代码
文本

Aim of this talktorial

In this talktorial, we investigate sequence similarity for kinases of interest. KLIFS API is used to retrieve the residues of the pocket sequence for each kinase.

Two similarity measures are implemented:

  1. Sequence identity, i.e., the similarity which is based on character-wise discrepancy.
  2. Sequence similarity, i.e., the similarity which is based on a substitution matrix, thus, reflecting similarities between amino acids.

Note: We focus on similarities between orthosteric kinase binding sites; similarities to allosteric binding sites are not covered.

代码
文本

Contents in Theory

  • Kinase dataset
  • Kinase similarity descriptor: Sequence
    • Identity score
    • Substitution score
  • From similarity matrix to distance matrix
代码
文本

Contents in Practical

  • Define the kinases of interest
  • Retrieve sequences from KLIFS
  • Sequence similarity
    • Identity score
    • Substitution score
  • Kinase similarity
    • Visualize similarity as kinase matrix
    • Save kinase similarity matrix
  • Kinase distance matrix
    • Save kinase distance matrix
代码
文本

References

代码
文本

Theory

代码
文本

Kinase dataset

代码
文本

We use the kinase selection as defined in Talktorial T023.

代码
文本

Kinase similarity descriptor: Sequence

As mentioned in the previous talktorial, sequence is often used to assess kinase similarity (see the phylogenetic tree developed by Manning et al. Science (2002), 298(5600), 1912-1934).

代码
文本

In this talktorial, the KLIFS pocket sequence is used for two main reasons:

  1. The sequence is of fixed length (it contains residues), which makes computation for pairwise similarity between two sequences straightforward.

  2. The binding pocket is where the action takes place. Why consider the full kinase sequence when an residues sequence contains most relevant information?

代码
文本

sequence_logo

Figure 1: The sequence logo shows amino acid binding motifs and sequence profiles such as amino acid depletion. For example, the sequence logo easily visualizes the conserved G-rich loop (position 4-9) and the DFG motif (position 81-83), see Talktorial T023 for more details.

Note for reproducibility: The figure is generated using the Seq2Logo tool available at http://www.cbs.dtu.dk/biotools/Seq2Logo/. The input is the residues KLIFS binding pocket for the query kinases. All parameters are the the default ones. For the graphical layout, the entry Stacks Per Line is set to and Page size to

代码
文本

We now describe two ways to compare pocket sequences.

代码
文本

Identity score

A simple way of assessing the similarity between two sequences is to use the so-called identity score. First, a match vector is created: it checks whether for each position the characters from the two sequences are identical. If they are, the entry is set to and otherwise.

The identity score is computed by summing the elements in the match vector and normalizing the entry by the length of the sequence, which is in the case of KLIFS pocket sequence.

代码
文本

Let's consider the identity matrix below:

代码
文本
A C D E ...
A 1 0 0 0 ...
C 0 1 0 0 ...
D 0 0 1 0 ...
E 0 0 0 1 ...
... ... ... ... ... ...
代码
文本

and let for two kinases and .

代码
文本

We use the following as similarity between kinases and :

where represents the amino acid at position of the sequence of kinase .

代码
文本

Substitution score

Although the identity score is an easy measure of similarity, it does not take into account the rate at which an amino acid may change into another and treats all residues uniformly.

The substitution score takes the changes of the amino acids over evolutionary time into account. It makes use of a substitution matrix, where each entry gives a score between two amino acids. In this talktorial, we use the BLOSUM substitution matrix PNAS (1992), 89(22), 10915-10919, implemented in biotite.

代码
文本

The BLOSUM substitution matrix is defined as below (the full matrix will be displayed in the Practical part):

代码
文本
A C D E ...
A 4 0 -2 -1 ...
C 0 9 -3 -4 ...
D -2 -3 6 2 ...
E -1 -4 2 5 ...
... ... ... ... ... ...
代码
文本

The BLOSUM substitution matrix is symmetric, as shown in the practical part.

For convenience, we will translate and rescale the matrix using the following:

for translation, and

for all

for rescaling, such that

and

代码
文本

We use the following as similarity between kinases and :

where represents the amino acid at position of the sequence of kinase and .

代码
文本

From similarity matrix to distance matrix

代码
文本

In order to apply some clustering algorithm to assess the similarity between kinases, it is necessary to start with a distance matrix. A similarity matrix is not, by definition, a distance matrix. For example, the diagonal elements are not zero. For now, we map the similarity matrix to a distance matrix using

See Talktorial T028 for more details.

代码
文本

Practical

代码
文本
[1]
from pathlib import Path
import requests

import pandas as pd
import numpy as np
import seaborn as sns
import biotite.sequence.align as align
代码
文本
[2]
HERE = Path(_dh[-1])
DATA = HERE / "data"
代码
文本
[ ]
#download data
!git clone https://github.com/UR-Free/data.git
代码
文本
[3]
configs = pd.read_csv(HERE / "../T023_what_is_a_kinase/data/pipeline_configs.csv")
configs = configs.set_index("variable")["default_value"]
DEMO = bool(int(configs["DEMO"]))
print(f"Run in demo mode: {DEMO}")
# NBVAL_CHECK_OUTPUT
Run in demo mode: True
代码
文本

Define the kinases of interest

代码
文本

Let's load the kinase selection as defined in Talktorial T023.

代码
文本
[4]
kinase_selection_df = pd.read_csv(
HERE
/ "../T023_what_is_a_kinase/\
data/kinase_selection.csv"
)
kinase_selection_df
# NBVAL_CHECK_OUTPUT
kinase kinase_klifs uniprot_id group full_kinase_name
0 EGFR EGFR P00533 TK Epidermal growth factor receptor
1 ErbB2 ErbB2 P04626 TK Erythroblastic leukemia viral oncogene homolog 2
2 PI3K p110a P42336 Atypical Phosphatidylinositol-3-kinase
3 VEGFR2 KDR P35968 TK Vascular endothelial growth factor receptor 2
4 BRAF BRAF P15056 TKL Rapidly accelerated fibrosarcoma isoform B
5 CDK2 CDK2 P24941 CMGC Cyclic-dependent kinase 2
6 LCK LCK P06239 TK Lymphocyte-specific protein tyrosine kinase
7 MET MET P08581 TK Mesenchymal-epithelial transition factor
8 p38a p38a Q16539 CMGC p38 mitogen activated protein kinase alpha
代码
文本

Retrieve sequences from KLIFS

代码
文本

We use the KLIFS API to retrieve the -long pocket sequence for each kinase.

代码
文本
[5]
def klifs_pocket_sequence(klifs_kinase_name):
"""
Retrieves the pocket sequence from KLIFS using the API.

Parameters
----------
klifs_kinase_name : str
The KLIFS name of the kinase of interest.

Returns
-------
str :
The 85 residues pocket sequence from KLIFS,
if the kinase name is valid, None otherwise.
"""
response = requests.get(
f"https://klifs.net/api/kinase_ID?kinase_name={klifs_kinase_name}&species=HUMAN"
)

if response.status_code == 200:
return response.json()[0]["pocket"]
else:
print(f"KLIFS failed for kinase {klifs_kinase_name}")
return None
代码
文本

We create a dictionary made of the kinase name and its associated pocket sequence. This dictionary is used throughout this notebook.

代码
文本
[6]
kinase_sequences_dict = {}
for kinase in kinase_selection_df["kinase_klifs"]:
kinase_sequences_dict[kinase] = klifs_pocket_sequence(kinase)

for name, sequence in kinase_sequences_dict.items():
print(f"{name:10}{sequence}")
# NBVAL_CHECK_OUTPUT
EGFR      KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA
ErbB2     KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQLMPYGCLLDHVREYLEDVRLVHRDLAARNVLVITDFGLA
p110a     CRIMSSAKRPLWLIIFKNGDLRQDMLTLQIIRLRMLPYGCLVGLIEVVRSHTIMQIQCKATFI--LGIGDRHNSNIMVHIDFGHF
KDR       KPLGRGAFGQVIEVAVKMLALMSELKILIHIGLNVVNLLGAMVIVEFCKFGNLSTYLRSFLASRKCIHRDLAARNILLICDFGLA
BRAF      QRIGSGSFGTVYKVAVKMLAFKNEVGVLRKTRVNILLFMGYAIVTQWCEGSSLYHHLHIYLHAKSIIHRDLKSNNIFLIGDFGLA
CDK2      EKIGEGTYGVVYKVALKKITAIREISLLKELNPNIVKLLDVYLVFEFLH-QDLKKFMDAFCHSHRVLHRDLKPQNLLILADFGLA
LCK       ERLGAGQFGEVWMVAVKSLAFLAEANLMKQLQQRLVRLYAVYIITEYMENGSLVDFLKTFIEERNYIHRDLRAANILVIADFGLA
MET       EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA
p38a      SPVGSGAYGSVCAVAVKKLRTYRELRLLKHMKENVIGLLDVYLVTHLMG-ADLNNIVKCYIHSADIIHRDLKPSNLAVILDFGLA
代码
文本

As shown in the cell above, some sequences have missing residues, denoted by "-". Let's plot these sequences as heatmap for a quick visual on conserved regions.

代码
文本
[7]
# Cast dict to DataFrame
kinase_sequences_df = pd.DataFrame(
[list(i) for i in kinase_sequences_dict.values()],
index=kinase_sequences_dict.keys(),
columns=range(1, 86),
)
# Cast letters to integers (gap is set to None)
letter_to_int = {letter: ix for ix, letter in enumerate(list("ACDEFGHIKLMNPQRSTVWY"), 1)}
letter_to_int["-"] = None
kinase_sequences_int_df = kinase_sequences_df.applymap(lambda x: letter_to_int[x])
# Show heatmap (qualitative colormap)
ax = sns.heatmap(kinase_sequences_int_df, cmap="tab20", cbar=False)
代码
文本

Sequence similarity

代码
文本

Given two kinases, we create functions which account for identity or substitution similarity, as described in the Theory part.

代码
文本

Identity score

We first define a function which compares element-wise characters in two sequences.

代码
文本
[8]
def identity_score(sequence1, sequence2):
"""
Computes the element-wise binary similarity between two sequences.

Parameters
----------
sequence1 : np.array
An array of characters describing the first sequence.
sequence2 : np.array
An array of characters describing the second sequence.

Returns
-------
np.array
The bool array for each character.
1 if the elements are identical,
0 otherwise.
"""
# True is the character is the same, False otherwise
return np.compare_chararrays(sequence1, sequence2, cmp="==", rstrip=True)
代码
文本

Substitution score

We now define the function which is more specific to amino acids grouping and use the biotite library to retrieve the BLOSUM substitution matrix.

代码
文本

The substitution matrix can be retrieve from biotite using the following command:

代码
文本
[9]
substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()
print(substitution_matrix)
# NBVAL_CHECK_OUTPUT
    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y   B   Z   X   *
A   4   0  -2  -1  -2   0  -2  -1  -1  -1  -1  -2  -1  -1  -1   1   0   0  -3  -2  -2  -1   0  -4
C   0   9  -3  -4  -2  -3  -3  -1  -3  -1  -1  -3  -3  -3  -3  -1  -1  -1  -2  -2  -3  -3  -2  -4
D  -2  -3   6   2  -3  -1  -1  -3  -1  -4  -3   1  -1   0  -2   0  -1  -3  -4  -3   4   1  -1  -4
E  -1  -4   2   5  -3  -2   0  -3   1  -3  -2   0  -1   2   0   0  -1  -2  -3  -2   1   4  -1  -4
F  -2  -2  -3  -3   6  -3  -1   0  -3   0   0  -3  -4  -3  -3  -2  -2  -1   1   3  -3  -3  -1  -4
G   0  -3  -1  -2  -3   6  -2  -4  -2  -4  -3   0  -2  -2  -2   0  -2  -3  -2  -3  -1  -2  -1  -4
H  -2  -3  -1   0  -1  -2   8  -3  -1  -3  -2   1  -2   0   0  -1  -2  -3  -2   2   0   0  -1  -4
I  -1  -1  -3  -3   0  -4  -3   4  -3   2   1  -3  -3  -3  -3  -2  -1   3  -3  -1  -3  -3  -1  -4
K  -1  -3  -1   1  -3  -2  -1  -3   5  -2  -1   0  -1   1   2   0  -1  -2  -3  -2   0   1  -1  -4
L  -1  -1  -4  -3   0  -4  -3   2  -2   4   2  -3  -3  -2  -2  -2  -1   1  -2  -1  -4  -3  -1  -4
M  -1  -1  -3  -2   0  -3  -2   1  -1   2   5  -2  -2   0  -1  -1  -1   1  -1  -1  -3  -1  -1  -4
N  -2  -3   1   0  -3   0   1  -3   0  -3  -2   6  -2   0   0   1   0  -3  -4  -2   3   0  -1  -4
P  -1  -3  -1  -1  -4  -2  -2  -3  -1  -3  -2  -2   7  -1  -2  -1  -1  -2  -4  -3  -2  -1  -2  -4
Q  -1  -3   0   2  -3  -2   0  -3   1  -2   0   0  -1   5   1   0  -1  -2  -2  -1   0   3  -1  -4
R  -1  -3  -2   0  -3  -2   0  -3   2  -2  -1   0  -2   1   5  -1  -1  -3  -3  -2  -1   0  -1  -4
S   1  -1   0   0  -2   0  -1  -2   0  -2  -1   1  -1   0  -1   4   1  -2  -3  -2   0   0   0  -4
T   0  -1  -1  -1  -2  -2  -2  -1  -1  -1  -1   0  -1  -1  -1   1   5   0  -2  -2  -1  -1   0  -4
V   0  -1  -3  -2  -1  -3  -3   3  -2   1   1  -3  -2  -2  -3  -2   0   4  -3  -1  -3  -2  -1  -4
W  -3  -2  -4  -3   1  -2  -2  -3  -3  -2  -1  -4  -4  -2  -3  -3  -2  -3  11   2  -4  -3  -2  -4
Y  -2  -2  -3  -2   3  -3   2  -1  -2  -1  -1  -2  -3  -1  -2  -2  -2  -1   2   7  -3  -2  -1  -4
B  -2  -3   4   1  -3  -1   0  -3   0  -4  -3   3  -2   0  -1   0  -1  -3  -4  -3   4   1  -1  -4
Z  -1  -3   1   4  -3  -2   0  -3   1  -3  -1   0  -1   3   0   0  -1  -2  -3  -2   1   4  -1  -4
X   0  -2  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -2  -1  -1   0   0  -1  -2  -1  -1  -1  -1  -4
*  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4   1
代码
文本

Check for symmetry:

代码
文本
[10]
substitution_matrix.is_symmetric()
# NBVAL_CHECK_OUTPUT
True
代码
文本

Let's perform the translation-rescaling step we discussed in the Theory part.

代码
文本
[11]
def _translate_rescale_substitution_matrix(
substitution_matrix=align.SubstitutionMatrix.std_protein_matrix(),
):
"""
Translate and rescale substitution matrix.

Parameters
----------
substitution_matrix : biotite.sequence.align.SubstitutionMatrix
A substitution matrix specific to amino acids.
The default is align.SubstitutionMatrix.std_protein_matrix()
from biotite, which represents BLOSUM62.

Returns
-------
pd.DataFrame
Translated and rescaled substitution matrix as DataFrame
(index/columns contain letters).
"""
# Retrieve np.array from substitution matrix
score_matrix = substitution_matrix.score_matrix()

# Translation of substitution matrix
translated_matrix = score_matrix - np.min(score_matrix)

# Rescaling
normalized_score_matrix = np.zeros(score_matrix.shape)
for i in range(score_matrix.shape[0]):
for j in range(score_matrix.shape[0]):
normalized_score_matrix[i, j] = translated_matrix[i, j] / np.sqrt(
translated_matrix[i, i] * translated_matrix[j, j]
)

# Create DataFrame from matrix with letters as index/column names
normalized_score_matrix = pd.DataFrame(
normalized_score_matrix,
columns=substitution_matrix.get_alphabet1(),
index=substitution_matrix.get_alphabet1(),
)

# Check for symmetry
symmetric = (
normalized_score_matrix.values == normalized_score_matrix.values.transpose()
).all()
if not symmetric:
raise ValueError(f"Translated/rescaled matrix is not symmetric.")

return normalized_score_matrix
代码
文本

Let's take a look at the translated and rescaled version of the substitution matrix.

代码
文本
[12]
_translate_rescale_substitution_matrix().style.format("{:.2f}")
  A C D E F G H I K L M N P Q R S T V W Y B Z X *
A 1.00 0.39 0.22 0.35 0.22 0.45 0.20 0.38 0.35 0.38 0.35 0.22 0.32 0.35 0.35 0.62 0.47 0.50 0.09 0.21 0.25 0.38 0.82 0.00
C 0.39 1.00 0.09 0.00 0.18 0.09 0.08 0.29 0.09 0.29 0.28 0.09 0.08 0.09 0.09 0.29 0.28 0.29 0.14 0.17 0.10 0.10 0.32 0.00
D 0.22 0.09 1.00 0.63 0.10 0.30 0.27 0.11 0.32 0.00 0.11 0.50 0.29 0.42 0.21 0.45 0.32 0.11 0.00 0.10 0.89 0.56 0.55 0.00
E 0.35 0.00 0.63 1.00 0.11 0.21 0.38 0.12 0.56 0.12 0.22 0.42 0.30 0.67 0.44 0.47 0.33 0.24 0.09 0.20 0.59 0.94 0.58 0.00
F 0.22 0.18 0.10 0.11 1.00 0.10 0.27 0.45 0.11 0.45 0.42 0.10 0.00 0.11 0.11 0.22 0.21 0.34 0.41 0.67 0.11 0.11 0.55 0.00
G 0.45 0.09 0.30 0.21 0.10 1.00 0.18 0.00 0.21 0.00 0.11 0.40 0.19 0.21 0.21 0.45 0.21 0.11 0.16 0.10 0.34 0.22 0.55 0.00
H 0.20 0.08 0.27 0.38 0.27 0.18 1.00 0.10 0.29 0.10 0.19 0.46 0.17 0.38 0.38 0.31 0.19 0.10 0.15 0.52 0.41 0.41 0.50 0.00
I 0.38 0.29 0.11 0.12 0.45 0.00 0.10 1.00 0.12 0.75 0.59 0.11 0.11 0.12 0.12 0.25 0.35 0.88 0.09 0.32 0.12 0.12 0.61 0.00
K 0.35 0.09 0.32 0.56 0.11 0.21 0.29 0.12 1.00 0.24 0.33 0.42 0.30 0.56 0.67 0.47 0.33 0.24 0.09 0.20 0.47 0.59 0.58 0.00
L 0.38 0.29 0.00 0.12 0.45 0.00 0.10 0.75 0.24 1.00 0.71 0.11 0.11 0.24 0.24 0.25 0.35 0.62 0.18 0.32 0.00 0.12 0.61 0.00
M 0.35 0.28 0.11 0.22 0.42 0.11 0.19 0.59 0.33 0.71 1.00 0.21 0.20 0.44 0.33 0.35 0.33 0.59 0.26 0.30 0.12 0.35 0.58 0.00
N 0.22 0.09 0.50 0.42 0.10 0.40 0.46 0.11 0.42 0.11 0.21 1.00 0.19 0.42 0.42 0.56 0.42 0.11 0.00 0.19 0.78 0.45 0.55 0.00
P 0.32 0.08 0.29 0.30 0.00 0.19 0.17 0.11 0.30 0.11 0.20 0.19 1.00 0.30 0.20 0.32 0.30 0.21 0.00 0.09 0.21 0.32 0.35 0.00
Q 0.35 0.09 0.42 0.67 0.11 0.21 0.38 0.12 0.56 0.24 0.44 0.42 0.30 1.00 0.56 0.47 0.33 0.24 0.17 0.30 0.47 0.82 0.58 0.00
R 0.35 0.09 0.21 0.44 0.11 0.21 0.38 0.12 0.67 0.24 0.33 0.42 0.20 0.56 1.00 0.35 0.33 0.12 0.09 0.20 0.35 0.47 0.58 0.00
S 0.62 0.29 0.45 0.47 0.22 0.45 0.31 0.25 0.47 0.25 0.35 0.56 0.32 0.47 0.35 1.00 0.59 0.25 0.09 0.21 0.50 0.50 0.82 0.00
T 0.47 0.28 0.32 0.33 0.21 0.21 0.19 0.35 0.33 0.35 0.33 0.42 0.30 0.33 0.33 0.59 1.00 0.47 0.17 0.20 0.35 0.35 0.77 0.00
V 0.50 0.29 0.11 0.24 0.34 0.11 0.10 0.88 0.24 0.62 0.59 0.11 0.21 0.24 0.12 0.25 0.47 1.00 0.09 0.32 0.12 0.25 0.61 0.00
W 0.09 0.14 0.00 0.09 0.41 0.16 0.15 0.09 0.09 0.18 0.26 0.00 0.00 0.17 0.09 0.09 0.17 0.09 1.00 0.47 0.00 0.09 0.30 0.00
Y 0.21 0.17 0.10 0.20 0.67 0.10 0.52 0.32 0.20 0.32 0.30 0.19 0.09 0.30 0.20 0.21 0.20 0.32 0.47 1.00 0.11 0.21 0.52 0.00
B 0.25 0.10 0.89 0.59 0.11 0.34 0.41 0.12 0.47 0.00 0.12 0.78 0.21 0.47 0.35 0.50 0.35 0.12 0.00 0.11 1.00 0.62 0.61 0.00
Z 0.38 0.10 0.56 0.94 0.11 0.22 0.41 0.12 0.59 0.12 0.35 0.45 0.32 0.82 0.47 0.50 0.35 0.25 0.09 0.21 0.62 1.00 0.61 0.00
X 0.82 0.32 0.55 0.58 0.55 0.55 0.50 0.61 0.58 0.61 0.58 0.55 0.35 0.58 0.58 0.82 0.77 0.61 0.30 0.52 0.61 0.61 1.00 0.00
* 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
代码
文本

Let's define a function that calculates the substitution scores between two sequences (will make use of the previously defined function).

代码
文本
[13]
def substitution_score(
sequence1, sequence2, substitution_matrix=align.SubstitutionMatrix.std_protein_matrix()
):
"""
Retrieve the match score given the substitution matrix.

Parameters
----------
sequence1 : np.array
An array of characters describing the first sequence.
sequence2 : np.array
An array of characters describing the second sequence.
substitution_matrix : biotite.sequence.align.SubstitutionMatrix
A substitution matrix specific to amino acids.
The default is align.SubstitutionMatrix.std_protein_matrix()
from biotite, which represents BLOSUM62.

Returns
-------
np.array
The vector of match score
using the normalized substitution matrix.
"""
substitution_matrix_df = _translate_rescale_substitution_matrix(substitution_matrix)

match_score_array = np.zeros(len(sequence1))
for i, (character_seq1, character_seq2) in enumerate(zip(sequence1, sequence2)):
match_score_array[i] = substitution_matrix_df.loc[character_seq1, character_seq2]
return match_score_array
代码
文本

Kinase similarity

代码
文本

Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures, the identity or the substitution.

代码
文本
[14]
def sequence_similarity(sequence_1, sequence_2, type_="identity"):
"""
Compares two sequences using a given metric.

Parameters
----------
sequence_1, sequence_2 : str
The two sequences of strings for comparison.

type_ : str
The type of metric to compute the similarity.
The default is `identity`.

Returns
-------
float :
The similarity between the pocket sequences of the two kinases.
"""

# Replace possible unavailable residue
# noted in KLIFS with "-"
# by the symbol "*" for biotite
sequence_1 = sequence_1.replace("-", "*")
sequence_2 = sequence_2.replace("-", "*")

if len(sequence_1) != len(sequence_1):
raise ValueError(f"Mismatch in sequence lengths.")
else:
seq_array1 = np.array(list(sequence_1))
seq_array2 = np.array(list(sequence_2))

if type_ == "identity":
is_match_array = identity_score(seq_array1, seq_array2)
similarity_normed = sum(is_match_array) / len(sequence_1)
elif type_ == "substitution":
match_score_array = substitution_score(seq_array1, seq_array2)
similarity_normed = sum(match_score_array) / len(sequence_1)
else:
raise ValueError(f"Type {type_} not defined.")

return similarity_normed
代码
文本

Let's look at the sequence similarity between two kinases (see also Figure 2):

代码
文本
[15]
if DEMO:
example1 = "EGFR"
example2 = "MET"
else:
example1 = kinase_selection_df["kinase_klifs"][0]
example2 = kinase_selection_df["kinase_klifs"][1]

print("The sequences are:\n")
for key in (example1, example2):
print(f"{key:5s}: {kinase_sequences_dict[key]}")
# NBVAL_CHECK_OUTPUT
The sequences are:

EGFR : KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA
MET  : EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA
代码
文本
[16]
example_seq_similarity = sequence_similarity(
kinase_sequences_dict[example1], kinase_sequences_dict[example2], "identity"
)

print(
f"Pocket sequence similarity between {example1} and {example2} kinases: "
f"{example_seq_similarity:.2f} using identity."
)
# NBVAL_CHECK_OUTPUT
Pocket sequence similarity between EGFR and MET kinases: 0.46 using identity.
代码
文本
[17]
example_seq_similarity = sequence_similarity(
kinase_sequences_dict[example1], kinase_sequences_dict[example2], "substitution"
)
print(
f"Pocket sequence similarity between {example1} and {example2} kinases: "
f"{example_seq_similarity:.2f} using substitution."
)
# NBVAL_CHECK_OUTPUT
Pocket sequence similarity between EGFR and MET kinases: 0.71 using substitution.
代码
文本

EGFR and MET similarity

Figure 2: Sequences and sequence similarity between the kinases EGFR and MET.

代码
文本

We can also look at self-similarity:

代码
文本
[18]
example_seq_similarity = sequence_similarity(
kinase_sequences_dict[example1], kinase_sequences_dict[example1], type_="identity"
)
print(
f"Pocket sequence similarity between {example1} itself: "
f"{example_seq_similarity:.2f} using identity."
)
# NBVAL_CHECK_OUTPUT
Pocket sequence similarity between EGFR itself: 1.00 using identity.
代码
文本
[19]
example_seq_similarity = sequence_similarity(
kinase_sequences_dict[example1], kinase_sequences_dict[example1], type_="substitution"
)
print(
f"Pocket sequence similarity between {example1} itself: "
f"{example_seq_similarity:.2f} using substitution."
)
# NBVAL_CHECK_OUTPUT
Pocket sequence similarity between EGFR itself: 1.00 using substitution.
代码
文本

As expected, the similarity between a kinase and itself leads to the highest possible score:

代码
文本

Visualize similarity as kinase matrix

代码
文本
[20]
def pairwise_kinase_similarities(kinase_sequence_dictionary, similarity_measure):
"""
Calculate pairwise similarities between a set of kinases.

Parameters
----------
kinase_sequence_dictionary : dict
A dictionary containing the kinase name as key
and the kinase sequence as value.
similarity_measure : str
Select similarity measure: "identity" or "substitution".

Returns
-------
pd.DataFrame
All-against-all similarities between input kinases.
"""
# Initialize matrix with 0
kinase_similarity_matrix = np.zeros(
(len(kinase_sequence_dictionary), len(kinase_sequence_dictionary))
)
# Iterate over matrix and fill with similarity values
for i, kinase_sequence1 in enumerate(kinase_sequence_dictionary.values()):
for j, kinase_sequence2 in enumerate(kinase_sequence_dictionary.values()):
kinase_similarity_matrix[i, j] = sequence_similarity(
kinase_sequence1, kinase_sequence2, similarity_measure
)

# Cast matrix to DataFrame
kinase_similarity_matrix_df = pd.DataFrame(
data=kinase_similarity_matrix,
index=kinase_sequences_dict.keys(),
columns=kinase_sequences_dict.keys(),
)
kinase_similarity_matrix_df.index.name = None
kinase_similarity_matrix_df.columns.name = None

# Check for symmetry
symmetric = (
kinase_similarity_matrix_df.values == kinase_similarity_matrix_df.values.transpose()
).all()
if not symmetric:
raise ValueError(f"Matrix is not symmetric.")

return kinase_similarity_matrix_df
代码
文本

We visualize the similarity matrices using identity and substitution:

代码
文本
Kinase similarity matrix: Identity
代码
文本
[21]
kinase_similarity_matrix_identity_df = pairwise_kinase_similarities(
kinase_sequences_dict, similarity_measure="identity"
)
kinase_similarity_matrix_identity_df
# NBVAL_CHECK_OUTPUT
EGFR ErbB2 p110a KDR BRAF CDK2 LCK MET p38a
EGFR 1.000000 0.894118 0.117647 0.470588 0.376471 0.317647 0.447059 0.458824 0.388235
ErbB2 0.894118 1.000000 0.117647 0.435294 0.400000 0.329412 0.423529 0.470588 0.400000
p110a 0.117647 0.117647 1.000000 0.152941 0.152941 0.105882 0.141176 0.105882 0.141176
KDR 0.470588 0.435294 0.152941 1.000000 0.400000 0.341176 0.435294 0.470588 0.388235
BRAF 0.376471 0.400000 0.152941 0.400000 1.000000 0.329412 0.388235 0.376471 0.376471
CDK2 0.317647 0.329412 0.105882 0.341176 0.329412 1.000000 0.376471 0.364706 0.470588
LCK 0.447059 0.423529 0.141176 0.435294 0.388235 0.376471 1.000000 0.400000 0.388235
MET 0.458824 0.470588 0.105882 0.470588 0.376471 0.364706 0.400000 1.000000 0.364706
p38a 0.388235 0.400000 0.141176 0.388235 0.376471 0.470588 0.388235 0.364706 1.000000
代码
文本
[22]
# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)
kinase_similarity_matrix_identity_df.style.background_gradient(cmap=cm).format("{:.3f}")
  EGFR ErbB2 p110a KDR BRAF CDK2 LCK MET p38a
EGFR 1.000 0.894 0.118 0.471 0.376 0.318 0.447 0.459 0.388
ErbB2 0.894 1.000 0.118 0.435 0.400 0.329 0.424 0.471 0.400
p110a 0.118 0.118 1.000 0.153 0.153 0.106 0.141 0.106 0.141
KDR 0.471 0.435 0.153 1.000 0.400 0.341 0.435 0.471 0.388
BRAF 0.376 0.400 0.153 0.400 1.000 0.329 0.388 0.376 0.376
CDK2 0.318 0.329 0.106 0.341 0.329 1.000 0.376 0.365 0.471
LCK 0.447 0.424 0.141 0.435 0.388 0.376 1.000 0.400 0.388
MET 0.459 0.471 0.106 0.471 0.376 0.365 0.400 1.000 0.365
p38a 0.388 0.400 0.141 0.388 0.376 0.471 0.388 0.365 1.000
代码
文本
Kinase similarity matrix: Substitution
代码
文本
[23]
kinase_similarity_matrix_substitution_df = pairwise_kinase_similarities(
kinase_sequences_dict, similarity_measure="substitution"
)
kinase_similarity_matrix_substitution_df
# NBVAL_CHECK_OUTPUT
EGFR ErbB2 p110a KDR BRAF CDK2 LCK MET p38a
EGFR 1.000000 0.940963 0.427062 0.716047 0.655028 0.648447 0.711321 0.711258 0.644644
ErbB2 0.940963 1.000000 0.413638 0.702118 0.654600 0.630308 0.685075 0.697967 0.635173
p110a 0.427062 0.413638 1.000000 0.422705 0.436459 0.423994 0.451336 0.393699 0.431194
KDR 0.716047 0.702118 0.422705 1.000000 0.671268 0.653379 0.687648 0.713806 0.653444
BRAF 0.655028 0.654600 0.436459 0.671268 1.000000 0.646755 0.672933 0.638158 0.637912
CDK2 0.648447 0.630308 0.423994 0.653379 0.646755 1.000000 0.681253 0.656025 0.723093
LCK 0.711321 0.685075 0.451336 0.687648 0.672933 0.681253 1.000000 0.690879 0.662581
MET 0.711258 0.697967 0.393699 0.713806 0.638158 0.656025 0.690879 1.000000 0.629355
p38a 0.644644 0.635173 0.431194 0.653444 0.637912 0.723093 0.662581 0.629355 1.000000
代码
文本
[24]
# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)
kinase_similarity_matrix_substitution_df.style.background_gradient(cmap=cm).format("{:.3f}")
  EGFR ErbB2 p110a KDR BRAF CDK2 LCK MET p38a
EGFR 1.000 0.941 0.427 0.716 0.655 0.648 0.711 0.711 0.645
ErbB2 0.941 1.000 0.414 0.702 0.655 0.630 0.685 0.698 0.635
p110a 0.427 0.414 1.000 0.423 0.436 0.424 0.451 0.394 0.431
KDR 0.716 0.702 0.423 1.000 0.671 0.653 0.688 0.714 0.653
BRAF 0.655 0.655 0.436 0.671 1.000 0.647 0.673 0.638 0.638
CDK2 0.648 0.630 0.424 0.653 0.647 1.000 0.681 0.656 0.723
LCK 0.711 0.685 0.451 0.688 0.673 0.681 1.000 0.691 0.663
MET 0.711 0.698 0.394 0.714 0.638 0.656 0.691 1.000 0.629
p38a 0.645 0.635 0.431 0.653 0.638 0.723 0.663 0.629 1.000
代码
文本

When we compare the matrices calculated based on the identity and substitution score, the overall pattern is similar, while the values are generally higher using the substitution score.

Note: For all downstream analysis, we will only consider the kinase similarity matrix calculated based on the substitution matrix.

代码
文本
[25]
kinase_similarity_matrix_df = kinase_similarity_matrix_substitution_df
代码
文本

Save kinase similarity matrix

代码
文本
[26]
file_name = f"kinase_similarity_matrix.csv"
kinase_similarity_matrix_df.to_csv(DATA / file_name)
代码
文本

Kinase distance matrix

代码
文本

Since all entries are between and , the similarity matrix is mapped to a distance matrix:

代码
文本
[27]
print(
f"The values of the similarity matrix lie between: "
f"{kinase_similarity_matrix_df.min().min():.2f}"
f" and {kinase_similarity_matrix_df.max().max():.2f}"
)
# NBVAL_CHECK_OUTPUT
The values of the similarity matrix lie between: 0.39 and 1.00
代码
文本
[28]
kinase_distance_matrix_df = 1 - kinase_similarity_matrix_df
代码
文本
[29]
kinase_distance_matrix_df.style.background_gradient(cmap=cm).format("{:.3f}")
  EGFR ErbB2 p110a KDR BRAF CDK2 LCK MET p38a
EGFR 0.000 0.059 0.573 0.284 0.345 0.352 0.289 0.289 0.355
ErbB2 0.059 0.000 0.586 0.298 0.345 0.370 0.315 0.302 0.365
p110a 0.573 0.586 0.000 0.577 0.564 0.576 0.549 0.606 0.569
KDR 0.284 0.298 0.577 0.000 0.329 0.347 0.312 0.286 0.347
BRAF 0.345 0.345 0.564 0.329 0.000 0.353 0.327 0.362 0.362
CDK2 0.352 0.370 0.576 0.347 0.353 0.000 0.319 0.344 0.277
LCK 0.289 0.315 0.549 0.312 0.327 0.319 0.000 0.309 0.337
MET 0.289 0.302 0.606 0.286 0.362 0.344 0.309 0.000 0.371
p38a 0.355 0.365 0.569 0.347 0.362 0.277 0.337 0.371 0.000
代码
文本

Save kinase distance matrix

代码
文本
[30]
file_name = f"kinase_distance_matrix.csv"
kinase_distance_matrix_df.to_csv(DATA / file_name)
代码
文本

Discussion

In this talktorial, we investigate how sequences can be used to measure similarity between kinases. The focus is set on the pocket sequence, which is retrieved from KLIFS. Sequence similarity can be assessed using two scores: 1. the identity, which treats all amino acids uniformly, and 2. the substitution, which takes into account the rate of change of residues over evolutionary time.

The kinase similarity matrix above will be reloaded in Talktorial T028, where we compare kinase similarities from different perspectives, including the pocket sequence perspective we have talked about in this talktorial.

代码
文本

Reference

https://github.com/volkamerlab/teachopencadd

Reprint statement

Original title: Kinase similarity: Sequence

Note: This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

代码
文本
化学信息学
TeachOpenCADD
化学信息学TeachOpenCADD
点个赞吧
推荐阅读
公开
TeachOpenCADD | 020 激酶的相似性:配体结构
化学信息学TeachOpenCADD
化学信息学TeachOpenCADD
YangHe
发布于 2023-06-15
公开
TeachOpenCADD | 018 激酶是什么?
化学信息学TeachOpenCADD
化学信息学TeachOpenCADD
YangHe
发布于 2023-06-15
1 赞