019 激酶的相似性：氨基酸序列

🏃🏻 快速开始
您可以直接在 Bohrium Notebook 上执行此文档。首先，请点击位于界面顶部的开始连接按钮，然后选择 bohrium-notebook:05-31 镜像并选择合适的的机器配置，稍等片刻即可开始运行。

📖 来源
本 Notebook 来自 https://github.com/volkamerlab/teachopencadd，由杨合 📨 修改搬运至 Bohrium Notebook。

代码

文本

Aim of this talktorial

In this talktorial, we investigate sequence similarity for kinases of interest. KLIFS API is used to retrieve the $85$ residues of the pocket sequence for each kinase.

Two similarity measures are implemented:

Sequence identity, i.e., the similarity which is based on character-wise discrepancy.
Sequence similarity, i.e., the similarity which is based on a substitution matrix, thus, reflecting similarities between amino acids.

Note: We focus on similarities between orthosteric kinase binding sites; similarities to allosteric binding sites are not covered.

代码

文本

Contents in Theory

Kinase dataset
Kinase similarity descriptor: Sequence
- Identity score
- Substitution score
From similarity matrix to distance matrix

代码

文本

Contents in Practical

Define the kinases of interest
Retrieve sequences from KLIFS
Sequence similarity
- Identity score
- Substitution score
Kinase similarity
- Visualize similarity as kinase matrix
- Save kinase similarity matrix
Kinase distance matrix
- Save kinase distance matrix

代码

文本

References

Kinase dataset: Molecules (2021), 26(3), 629
KLIFS
- KLIFS URL: https://klifs.net/
- KLIFS database: Nucleic Acid Res. (2020), 49(D1), D562-D569
Sequence-based kinase clustering: Manning et al. Science (2002), 298(5600), 1912-1934
Substitution matrix: PNAS (1992), 89(22), 10915-10919

Biotite

Documentation: https://www.biotite-python.org/index.html

Blosum matrix

Sequence logo: http://www.cbs.dtu.dk/biotools/Seq2Logo/

代码

文本

Theory

代码

文本

Kinase dataset

代码

文本

We use the kinase selection as defined in Talktorial T023.

代码

文本

Kinase similarity descriptor: Sequence

As mentioned in the previous talktorial, sequence is often used to assess kinase similarity (see the phylogenetic tree developed by Manning et al. Science (2002), 298(5600), 1912-1934).

代码

文本

In this talktorial, the KLIFS pocket sequence is used for two main reasons:

The sequence is of fixed length (it contains $85$ residues), which makes computation for pairwise similarity between two sequences straightforward.
The binding pocket is where the action takes place. Why consider the full kinase sequence when an $85$ residues sequence contains most relevant information?

代码

文本

sequence_logo

Figure 1: The sequence logo shows amino acid binding motifs and sequence profiles such as amino acid depletion. For example, the sequence logo easily visualizes the conserved G-rich loop (position 4-9) and the DFG motif (position 81-83), see Talktorial T023 for more details.

Note for reproducibility: The figure is generated using the Seq2Logo tool available at http://www.cbs.dtu.dk/biotools/Seq2Logo/. The input is the $85$ residues KLIFS binding pocket for the query kinases. All parameters are the the default ones. For the graphical layout, the entry Stacks Per Line is set to $85$ and Page size to $1200 \times 480.$

代码

文本

We now describe two ways to compare pocket sequences.

代码

文本

Identity score

A simple way of assessing the similarity between two sequences is to use the so-called identity score. First, a match vector is created: it checks whether for each position the characters from the two sequences are identical. If they are, the entry is set to $1,$ and $0$ otherwise.

The identity score is computed by summing the elements in the match vector and normalizing the entry by the length of the sequence, which is $85$ in the case of KLIFS pocket sequence.

代码

文本

Let's consider the identity matrix $I$ below:

代码

文本

	A	C	D	E	...
A	1	0	0	0	...
C	0	1	0	0	...
D	0	0	1	0	...
E	0	0	0	1	...
...	...	...	...	...	...

代码

文本

and let $M = l e n (K_{i}) = l e n (K_{j})$ for two kinases $K_{i}$ and $K_{j}$ .

代码

文本

We use the following as similarity between kinases $K_{i}$ and $K_{j}$ :

$similarity (K_{i}, K_{j}) = \frac{1}{M} n = 1 \sum M I (K_{i} [n], K_{j} [n]),$

where $K_{i} [n]$ represents the amino acid at position $n$ of the sequence of kinase $i$ .

代码

文本

Substitution score

Although the identity score is an easy measure of similarity, it does not take into account the rate at which an amino acid may change into another and treats all residues uniformly.

The substitution score takes the changes of the amino acids over evolutionary time into account. It makes use of a substitution matrix, where each entry gives a score between two amino acids. In this talktorial, we use the BLOSUM substitution matrix PNAS (1992), 89(22), 10915-10919, implemented in biotite.

代码

文本

The BLOSUM substitution matrix $SM$ is defined as below (the full matrix will be displayed in the Practical part):

代码

文本

	A	C	D	E	...
A	4	0	-2	-1	...
C	0	9	-3	-4	...
D	-2	-3	6	2	...
E	-1	-4	2	5	...
...	...	...	...	...	...

代码

文本

The BLOSUM substitution matrix is symmetric, as shown in the practical part.

For convenience, we will translate and rescale the matrix using the following:

$S M^{'} = SM - min (SM),$

for translation, and

for all $x, y$

$S M^{''} (x, y) = \frac{S M ^{'} [ x , y ]}{S M ^{'} [ x , x ] * S M ^{'} [ y , y ]},$

for rescaling, such that

$S M^{''} [x, x] = 1 for all x,$ and

$S M^{''} [x, y] \in [0, 1] for all x, y .$

代码

文本

We use the following as similarity between kinases $K_{i}$ and $K_{j}$ :

$similarity (K_{i}, K_{j}) = \frac{1}{M} n = 1 \sum M S M^{''} (K_{i} [n], K_{j} [n]),$

where $K_{i} [n]$ represents the amino acid at position $n$ of the sequence of kinase $i$ and $M = l e n (K_{i}) = l e n (K_{j})$ .

代码

文本

From similarity matrix to distance matrix

代码

文本

In order to apply some clustering algorithm to assess the similarity between kinases, it is necessary to start with a distance matrix. A similarity matrix is not, by definition, a distance matrix. For example, the diagonal elements are not zero. For now, we map the similarity matrix $SM$ to a distance matrix $D M$ using

$D M = 1 - SM .$

See Talktorial T028 for more details.

代码

文本

Practical

代码

文本

[1]

from pathlib import Path

import requests

import pandas as pd

import numpy as np

import seaborn as sns

import biotite.sequence.align as align

代码

文本

[2]

HERE = Path(_dh[-1])

DATA = HERE / "data"

代码

文本

[ ]

#download data

!git clone https://github.com/UR-Free/data.git

代码

文本

[3]

configs = pd.read_csv(HERE / "../T023_what_is_a_kinase/data/pipeline_configs.csv")

configs = configs.set_index("variable")["default_value"]

DEMO = bool(int(configs["DEMO"]))

print(f"Run in demo mode: {DEMO}")

# NBVAL_CHECK_OUTPUT

Run in demo mode: True

代码

文本

Define the kinases of interest

代码

文本

Let's load the kinase selection as defined in Talktorial T023.

代码

文本

[4]

kinase_selection_df = pd.read_csv(

HERE

/ "../T023_what_is_a_kinase/\

data/kinase_selection.csv"

)

kinase_selection_df

# NBVAL_CHECK_OUTPUT

	kinase	kinase_klifs	uniprot_id	group	full_kinase_name
0	EGFR	EGFR	P00533	TK	Epidermal growth factor receptor
1	ErbB2	ErbB2	P04626	TK	Erythroblastic leukemia viral oncogene homolog 2
2	PI3K	p110a	P42336	Atypical	Phosphatidylinositol-3-kinase
3	VEGFR2	KDR	P35968	TK	Vascular endothelial growth factor receptor 2
4	BRAF	BRAF	P15056	TKL	Rapidly accelerated fibrosarcoma isoform B
5	CDK2	CDK2	P24941	CMGC	Cyclic-dependent kinase 2
6	LCK	LCK	P06239	TK	Lymphocyte-specific protein tyrosine kinase
7	MET	MET	P08581	TK	Mesenchymal-epithelial transition factor
8	p38a	p38a	Q16539	CMGC	p38 mitogen activated protein kinase alpha

代码

文本

Retrieve sequences from KLIFS

代码

文本

We use the KLIFS API to retrieve the $85$ -long pocket sequence for each kinase.

代码

文本

[5]

def klifs_pocket_sequence(klifs_kinase_name):

"""

Retrieves the pocket sequence from KLIFS using the API.

Parameters

----------

klifs_kinase_name : str

The KLIFS name of the kinase of interest.

Returns

-------

str :

The 85 residues pocket sequence from KLIFS,

if the kinase name is valid, None otherwise.

"""

response = requests.get(

f"https://klifs.net/api/kinase_ID?kinase_name={klifs_kinase_name}&species=HUMAN"

)

if response.status_code == 200:

return response.json()[0]["pocket"]

else:

print(f"KLIFS failed for kinase {klifs_kinase_name}")

return None

代码

文本

We create a dictionary made of the kinase name and its associated pocket sequence. This dictionary is used throughout this notebook.

代码

文本

[6]

kinase_sequences_dict = {}

for kinase in kinase_selection_df["kinase_klifs"]:

kinase_sequences_dict[kinase] = klifs_pocket_sequence(kinase)

for name, sequence in kinase_sequences_dict.items():

print(f"{name:10}{sequence}")

# NBVAL_CHECK_OUTPUT

EGFR      KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA
ErbB2     KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQLMPYGCLLDHVREYLEDVRLVHRDLAARNVLVITDFGLA
p110a     CRIMSSAKRPLWLIIFKNGDLRQDMLTLQIIRLRMLPYGCLVGLIEVVRSHTIMQIQCKATFI--LGIGDRHNSNIMVHIDFGHF
KDR       KPLGRGAFGQVIEVAVKMLALMSELKILIHIGLNVVNLLGAMVIVEFCKFGNLSTYLRSFLASRKCIHRDLAARNILLICDFGLA
BRAF      QRIGSGSFGTVYKVAVKMLAFKNEVGVLRKTRVNILLFMGYAIVTQWCEGSSLYHHLHIYLHAKSIIHRDLKSNNIFLIGDFGLA
CDK2      EKIGEGTYGVVYKVALKKITAIREISLLKELNPNIVKLLDVYLVFEFLH-QDLKKFMDAFCHSHRVLHRDLKPQNLLILADFGLA
LCK       ERLGAGQFGEVWMVAVKSLAFLAEANLMKQLQQRLVRLYAVYIITEYMENGSLVDFLKTFIEERNYIHRDLRAANILVIADFGLA
MET       EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA
p38a      SPVGSGAYGSVCAVAVKKLRTYRELRLLKHMKENVIGLLDVYLVTHLMG-ADLNNIVKCYIHSADIIHRDLKPSNLAVILDFGLA

代码

文本

As shown in the cell above, some sequences have missing residues, denoted by "-". Let's plot these sequences as heatmap for a quick visual on conserved regions.

代码

文本

[7]

# Cast dict to DataFrame

kinase_sequences_df = pd.DataFrame(

[list(i) for i in kinase_sequences_dict.values()],

index=kinase_sequences_dict.keys(),

columns=range(1, 86),

)

# Cast letters to integers (gap is set to None)

letter_to_int = {letter: ix for ix, letter in enumerate(list("ACDEFGHIKLMNPQRSTVWY"), 1)}

letter_to_int["-"] = None

kinase_sequences_int_df = kinase_sequences_df.applymap(lambda x: letter_to_int[x])

# Show heatmap (qualitative colormap)

ax = sns.heatmap(kinase_sequences_int_df, cmap="tab20", cbar=False)

代码

文本

Sequence similarity

代码

文本

Given two kinases, we create functions which account for identity or substitution similarity, as described in the Theory part.

代码

文本

Identity score

We first define a function which compares element-wise characters in two sequences.

代码

文本

[8]

def identity_score(sequence1, sequence2):

"""

Computes the element-wise binary similarity between two sequences.

Parameters

----------

sequence1 : np.array

An array of characters describing the first sequence.

sequence2 : np.array

An array of characters describing the second sequence.

Returns

-------

np.array

The bool array for each character.

1 if the elements are identical,

0 otherwise.

"""

# True is the character is the same, False otherwise

return np.compare_chararrays(sequence1, sequence2, cmp="==", rstrip=True)

代码

文本

Substitution score

We now define the function which is more specific to amino acids grouping and use the biotite library to retrieve the BLOSUM substitution matrix.

代码

文本

The substitution matrix can be retrieve from biotite using the following command:

代码

文本

[9]

substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()

print(substitution_matrix)

# NBVAL_CHECK_OUTPUT

    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y   B   Z   X   *
A   4   0  -2  -1  -2   0  -2  -1  -1  -1  -1  -2  -1  -1  -1   1   0   0  -3  -2  -2  -1   0  -4
C   0   9  -3  -4  -2  -3  -3  -1  -3  -1  -1  -3  -3  -3  -3  -1  -1  -1  -2  -2  -3  -3  -2  -4
D  -2  -3   6   2  -3  -1  -1  -3  -1  -4  -3   1  -1   0  -2   0  -1  -3  -4  -3   4   1  -1  -4
E  -1  -4   2   5  -3  -2   0  -3   1  -3  -2   0  -1   2   0   0  -1  -2  -3  -2   1   4  -1  -4
F  -2  -2  -3  -3   6  -3  -1   0  -3   0   0  -3  -4  -3  -3  -2  -2  -1   1   3  -3  -3  -1  -4
G   0  -3  -1  -2  -3   6  -2  -4  -2  -4  -3   0  -2  -2  -2   0  -2  -3  -2  -3  -1  -2  -1  -4
H  -2  -3  -1   0  -1  -2   8  -3  -1  -3  -2   1  -2   0   0  -1  -2  -3  -2   2   0   0  -1  -4
I  -1  -1  -3  -3   0  -4  -3   4  -3   2   1  -3  -3  -3  -3  -2  -1   3  -3  -1  -3  -3  -1  -4
K  -1  -3  -1   1  -3  -2  -1  -3   5  -2  -1   0  -1   1   2   0  -1  -2  -3  -2   0   1  -1  -4
L  -1  -1  -4  -3   0  -4  -3   2  -2   4   2  -3  -3  -2  -2  -2  -1   1  -2  -1  -4  -3  -1  -4
M  -1  -1  -3  -2   0  -3  -2   1  -1   2   5  -2  -2   0  -1  -1  -1   1  -1  -1  -3  -1  -1  -4
N  -2  -3   1   0  -3   0   1  -3   0  -3  -2   6  -2   0   0   1   0  -3  -4  -2   3   0  -1  -4
P  -1  -3  -1  -1  -4  -2  -2  -3  -1  -3  -2  -2   7  -1  -2  -1  -1  -2  -4  -3  -2  -1  -2  -4
Q  -1  -3   0   2  -3  -2   0  -3   1  -2   0   0  -1   5   1   0  -1  -2  -2  -1   0   3  -1  -4
R  -1  -3  -2   0  -3  -2   0  -3   2  -2  -1   0  -2   1   5  -1  -1  -3  -3  -2  -1   0  -1  -4
S   1  -1   0   0  -2   0  -1  -2   0  -2  -1   1  -1   0  -1   4   1  -2  -3  -2   0   0   0  -4
T   0  -1  -1  -1  -2  -2  -2  -1  -1  -1  -1   0  -1  -1  -1   1   5   0  -2  -2  -1  -1   0  -4
V   0  -1  -3  -2  -1  -3  -3   3  -2   1   1  -3  -2  -2  -3  -2   0   4  -3  -1  -3  -2  -1  -4
W  -3  -2  -4  -3   1  -2  -2  -3  -3  -2  -1  -4  -4  -2  -3  -3  -2  -3  11   2  -4  -3  -2  -4
Y  -2  -2  -3  -2   3  -3   2  -1  -2  -1  -1  -2  -3  -1  -2  -2  -2  -1   2   7  -3  -2  -1  -4
B  -2  -3   4   1  -3  -1   0  -3   0  -4  -3   3  -2   0  -1   0  -1  -3  -4  -3   4   1  -1  -4
Z  -1  -3   1   4  -3  -2   0  -3   1  -3  -1   0  -1   3   0   0  -1  -2  -3  -2   1   4  -1  -4
X   0  -2  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -2  -1  -1   0   0  -1  -2  -1  -1  -1  -1  -4
*  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4   1

代码

文本

Check for symmetry:

代码

文本

[10]

substitution_matrix.is_symmetric()

# NBVAL_CHECK_OUTPUT

True

代码

文本

Let's perform the translation-rescaling step we discussed in the Theory part.

代码

文本

[11]

def _translate_rescale_substitution_matrix(

substitution_matrix=align.SubstitutionMatrix.std_protein_matrix(),

):

"""

Translate and rescale substitution matrix.

Parameters

----------

substitution_matrix : biotite.sequence.align.SubstitutionMatrix

A substitution matrix specific to amino acids.

The default is align.SubstitutionMatrix.std_protein_matrix()

from biotite, which represents BLOSUM62.

Returns

-------

pd.DataFrame

Translated and rescaled substitution matrix as DataFrame

(index/columns contain letters).

"""

# Retrieve np.array from substitution matrix

score_matrix = substitution_matrix.score_matrix()

# Translation of substitution matrix

translated_matrix = score_matrix - np.min(score_matrix)

# Rescaling

normalized_score_matrix = np.zeros(score_matrix.shape)

for i in range(score_matrix.shape[0]):

for j in range(score_matrix.shape[0]):

normalized_score_matrix[i, j] = translated_matrix[i, j] / np.sqrt(

translated_matrix[i, i] * translated_matrix[j, j]

)

# Create DataFrame from matrix with letters as index/column names

normalized_score_matrix = pd.DataFrame(

normalized_score_matrix,

columns=substitution_matrix.get_alphabet1(),

index=substitution_matrix.get_alphabet1(),

)

# Check for symmetry

symmetric = (

normalized_score_matrix.values == normalized_score_matrix.values.transpose()

).all()

if not symmetric:

raise ValueError(f"Translated/rescaled matrix is not symmetric.")

return normalized_score_matrix

代码

文本

Let's take a look at the translated and rescaled version of the substitution matrix.

代码

文本

[12]

_translate_rescale_substitution_matrix().style.format("{:.2f}")

	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y	B	Z	X	*
A	1.00	0.39	0.22	0.35	0.22	0.45	0.20	0.38	0.35	0.38	0.35	0.22	0.32	0.35	0.35	0.62	0.47	0.50	0.09	0.21	0.25	0.38	0.82	0.00
C	0.39	1.00	0.09	0.00	0.18	0.09	0.08	0.29	0.09	0.29	0.28	0.09	0.08	0.09	0.09	0.29	0.28	0.29	0.14	0.17	0.10	0.10	0.32	0.00
D	0.22	0.09	1.00	0.63	0.10	0.30	0.27	0.11	0.32	0.00	0.11	0.50	0.29	0.42	0.21	0.45	0.32	0.11	0.00	0.10	0.89	0.56	0.55	0.00
E	0.35	0.00	0.63	1.00	0.11	0.21	0.38	0.12	0.56	0.12	0.22	0.42	0.30	0.67	0.44	0.47	0.33	0.24	0.09	0.20	0.59	0.94	0.58	0.00
F	0.22	0.18	0.10	0.11	1.00	0.10	0.27	0.45	0.11	0.45	0.42	0.10	0.00	0.11	0.11	0.22	0.21	0.34	0.41	0.67	0.11	0.11	0.55	0.00
G	0.45	0.09	0.30	0.21	0.10	1.00	0.18	0.00	0.21	0.00	0.11	0.40	0.19	0.21	0.21	0.45	0.21	0.11	0.16	0.10	0.34	0.22	0.55	0.00
H	0.20	0.08	0.27	0.38	0.27	0.18	1.00	0.10	0.29	0.10	0.19	0.46	0.17	0.38	0.38	0.31	0.19	0.10	0.15	0.52	0.41	0.41	0.50	0.00
I	0.38	0.29	0.11	0.12	0.45	0.00	0.10	1.00	0.12	0.75	0.59	0.11	0.11	0.12	0.12	0.25	0.35	0.88	0.09	0.32	0.12	0.12	0.61	0.00
K	0.35	0.09	0.32	0.56	0.11	0.21	0.29	0.12	1.00	0.24	0.33	0.42	0.30	0.56	0.67	0.47	0.33	0.24	0.09	0.20	0.47	0.59	0.58	0.00
L	0.38	0.29	0.00	0.12	0.45	0.00	0.10	0.75	0.24	1.00	0.71	0.11	0.11	0.24	0.24	0.25	0.35	0.62	0.18	0.32	0.00	0.12	0.61	0.00
M	0.35	0.28	0.11	0.22	0.42	0.11	0.19	0.59	0.33	0.71	1.00	0.21	0.20	0.44	0.33	0.35	0.33	0.59	0.26	0.30	0.12	0.35	0.58	0.00
N	0.22	0.09	0.50	0.42	0.10	0.40	0.46	0.11	0.42	0.11	0.21	1.00	0.19	0.42	0.42	0.56	0.42	0.11	0.00	0.19	0.78	0.45	0.55	0.00
P	0.32	0.08	0.29	0.30	0.00	0.19	0.17	0.11	0.30	0.11	0.20	0.19	1.00	0.30	0.20	0.32	0.30	0.21	0.00	0.09	0.21	0.32	0.35	0.00
Q	0.35	0.09	0.42	0.67	0.11	0.21	0.38	0.12	0.56	0.24	0.44	0.42	0.30	1.00	0.56	0.47	0.33	0.24	0.17	0.30	0.47	0.82	0.58	0.00
R	0.35	0.09	0.21	0.44	0.11	0.21	0.38	0.12	0.67	0.24	0.33	0.42	0.20	0.56	1.00	0.35	0.33	0.12	0.09	0.20	0.35	0.47	0.58	0.00
S	0.62	0.29	0.45	0.47	0.22	0.45	0.31	0.25	0.47	0.25	0.35	0.56	0.32	0.47	0.35	1.00	0.59	0.25	0.09	0.21	0.50	0.50	0.82	0.00
T	0.47	0.28	0.32	0.33	0.21	0.21	0.19	0.35	0.33	0.35	0.33	0.42	0.30	0.33	0.33	0.59	1.00	0.47	0.17	0.20	0.35	0.35	0.77	0.00
V	0.50	0.29	0.11	0.24	0.34	0.11	0.10	0.88	0.24	0.62	0.59	0.11	0.21	0.24	0.12	0.25	0.47	1.00	0.09	0.32	0.12	0.25	0.61	0.00
W	0.09	0.14	0.00	0.09	0.41	0.16	0.15	0.09	0.09	0.18	0.26	0.00	0.00	0.17	0.09	0.09	0.17	0.09	1.00	0.47	0.00	0.09	0.30	0.00
Y	0.21	0.17	0.10	0.20	0.67	0.10	0.52	0.32	0.20	0.32	0.30	0.19	0.09	0.30	0.20	0.21	0.20	0.32	0.47	1.00	0.11	0.21	0.52	0.00
B	0.25	0.10	0.89	0.59	0.11	0.34	0.41	0.12	0.47	0.00	0.12	0.78	0.21	0.47	0.35	0.50	0.35	0.12	0.00	0.11	1.00	0.62	0.61	0.00
Z	0.38	0.10	0.56	0.94	0.11	0.22	0.41	0.12	0.59	0.12	0.35	0.45	0.32	0.82	0.47	0.50	0.35	0.25	0.09	0.21	0.62	1.00	0.61	0.00
X	0.82	0.32	0.55	0.58	0.55	0.55	0.50	0.61	0.58	0.61	0.58	0.55	0.35	0.58	0.58	0.82	0.77	0.61	0.30	0.52	0.61	0.61	1.00	0.00
*	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00

代码

文本

Let's define a function that calculates the substitution scores between two sequences (will make use of the previously defined function).

代码

文本

[13]

def substitution_score(

sequence1, sequence2, substitution_matrix=align.SubstitutionMatrix.std_protein_matrix()

):

"""

Retrieve the match score given the substitution matrix.

Parameters

----------

sequence1 : np.array

An array of characters describing the first sequence.

sequence2 : np.array

An array of characters describing the second sequence.

substitution_matrix : biotite.sequence.align.SubstitutionMatrix

A substitution matrix specific to amino acids.

The default is align.SubstitutionMatrix.std_protein_matrix()

from biotite, which represents BLOSUM62.

Returns

-------

np.array

The vector of match score

using the normalized substitution matrix.

"""

substitution_matrix_df = _translate_rescale_substitution_matrix(substitution_matrix)

match_score_array = np.zeros(len(sequence1))

for i, (character_seq1, character_seq2) in enumerate(zip(sequence1, sequence2)):

match_score_array[i] = substitution_matrix_df.loc[character_seq1, character_seq2]

return match_score_array

代码

文本

Kinase similarity

代码

文本

Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures, the identity or the substitution.

代码

文本

[14]

def sequence_similarity(sequence_1, sequence_2, type_="identity"):

"""

Compares two sequences using a given metric.

Parameters

----------

sequence_1, sequence_2 : str

The two sequences of strings for comparison.

type_ : str

The type of metric to compute the similarity.

The default is `identity`.

Returns

-------

float :

The similarity between the pocket sequences of the two kinases.

"""

# Replace possible unavailable residue

# noted in KLIFS with "-"

# by the symbol "*" for biotite

sequence_1 = sequence_1.replace("-", "*")

sequence_2 = sequence_2.replace("-", "*")

if len(sequence_1) != len(sequence_1):

raise ValueError(f"Mismatch in sequence lengths.")

else:

seq_array1 = np.array(list(sequence_1))

seq_array2 = np.array(list(sequence_2))

if type_ == "identity":

is_match_array = identity_score(seq_array1, seq_array2)

similarity_normed = sum(is_match_array) / len(sequence_1)

elif type_ == "substitution":

match_score_array = substitution_score(seq_array1, seq_array2)

similarity_normed = sum(match_score_array) / len(sequence_1)

else:

raise ValueError(f"Type {type_} not defined.")

return similarity_normed

代码

文本

Let's look at the sequence similarity between two kinases (see also Figure 2):

代码

文本

[15]

if DEMO:

example1 = "EGFR"

example2 = "MET"

else:

example1 = kinase_selection_df["kinase_klifs"][0]

example2 = kinase_selection_df["kinase_klifs"][1]

print("The sequences are:\n")

for key in (example1, example2):

print(f"{key:5s}: {kinase_sequences_dict[key]}")

# NBVAL_CHECK_OUTPUT

The sequences are:

EGFR : KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA
MET  : EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA

代码

文本

[16]

example_seq_similarity = sequence_similarity(

kinase_sequences_dict[example1], kinase_sequences_dict[example2], "identity"

)

print(

f"Pocket sequence similarity between {example1} and {example2} kinases: "

f"{example_seq_similarity:.2f} using identity."

)

# NBVAL_CHECK_OUTPUT

Pocket sequence similarity between EGFR and MET kinases: 0.46 using identity.

代码

文本

[17]

example_seq_similarity = sequence_similarity(

kinase_sequences_dict[example1], kinase_sequences_dict[example2], "substitution"

)

print(

f"Pocket sequence similarity between {example1} and {example2} kinases: "

f"{example_seq_similarity:.2f} using substitution."

)

# NBVAL_CHECK_OUTPUT

Pocket sequence similarity between EGFR and MET kinases: 0.71 using substitution.

代码

文本

EGFR and MET similarity

Figure 2: Sequences and sequence similarity between the kinases EGFR and MET.

代码

文本

We can also look at self-similarity:

代码

文本

[18]

example_seq_similarity = sequence_similarity(

kinase_sequences_dict[example1], kinase_sequences_dict[example1], type_="identity"

)

print(

f"Pocket sequence similarity between {example1} itself: "

f"{example_seq_similarity:.2f} using identity."

)

# NBVAL_CHECK_OUTPUT

Pocket sequence similarity between EGFR itself: 1.00 using identity.

代码

文本

[19]

example_seq_similarity = sequence_similarity(

kinase_sequences_dict[example1], kinase_sequences_dict[example1], type_="substitution"

)

print(

f"Pocket sequence similarity between {example1} itself: "

f"{example_seq_similarity:.2f} using substitution."

)

# NBVAL_CHECK_OUTPUT

Pocket sequence similarity between EGFR itself: 1.00 using substitution.

代码

文本

As expected, the similarity between a kinase and itself leads to the highest possible score:

代码

文本

Visualize similarity as kinase matrix

代码

文本

[20]

def pairwise_kinase_similarities(kinase_sequence_dictionary, similarity_measure):

"""

Calculate pairwise similarities between a set of kinases.

Parameters

----------

kinase_sequence_dictionary : dict

A dictionary containing the kinase name as key

and the kinase sequence as value.

similarity_measure : str

Select similarity measure: "identity" or "substitution".

Returns

-------

pd.DataFrame

All-against-all similarities between input kinases.

"""

# Initialize matrix with 0

kinase_similarity_matrix = np.zeros(

(len(kinase_sequence_dictionary), len(kinase_sequence_dictionary))

)

# Iterate over matrix and fill with similarity values

for i, kinase_sequence1 in enumerate(kinase_sequence_dictionary.values()):

for j, kinase_sequence2 in enumerate(kinase_sequence_dictionary.values()):

kinase_similarity_matrix[i, j] = sequence_similarity(

kinase_sequence1, kinase_sequence2, similarity_measure

)

# Cast matrix to DataFrame

kinase_similarity_matrix_df = pd.DataFrame(

data=kinase_similarity_matrix,

index=kinase_sequences_dict.keys(),

columns=kinase_sequences_dict.keys(),

)

kinase_similarity_matrix_df.index.name = None

kinase_similarity_matrix_df.columns.name = None

# Check for symmetry

symmetric = (

kinase_similarity_matrix_df.values == kinase_similarity_matrix_df.values.transpose()

).all()

if not symmetric:

raise ValueError(f"Matrix is not symmetric.")

return kinase_similarity_matrix_df

代码

文本

We visualize the similarity matrices using identity and substitution:

代码

文本

Kinase similarity matrix: Identity

代码

文本

[21]

kinase_similarity_matrix_identity_df = pairwise_kinase_similarities(

kinase_sequences_dict, similarity_measure="identity"

)

kinase_similarity_matrix_identity_df

# NBVAL_CHECK_OUTPUT

	EGFR	ErbB2	p110a	KDR	BRAF	CDK2	LCK	MET	p38a
EGFR	1.000000	0.894118	0.117647	0.470588	0.376471	0.317647	0.447059	0.458824	0.388235
ErbB2	0.894118	1.000000	0.117647	0.435294	0.400000	0.329412	0.423529	0.470588	0.400000
p110a	0.117647	0.117647	1.000000	0.152941	0.152941	0.105882	0.141176	0.105882	0.141176
KDR	0.470588	0.435294	0.152941	1.000000	0.400000	0.341176	0.435294	0.470588	0.388235
BRAF	0.376471	0.400000	0.152941	0.400000	1.000000	0.329412	0.388235	0.376471	0.376471
CDK2	0.317647	0.329412	0.105882	0.341176	0.329412	1.000000	0.376471	0.364706	0.470588
LCK	0.447059	0.423529	0.141176	0.435294	0.388235	0.376471	1.000000	0.400000	0.388235
MET	0.458824	0.470588	0.105882	0.470588	0.376471	0.364706	0.400000	1.000000	0.364706
p38a	0.388235	0.400000	0.141176	0.388235	0.376471	0.470588	0.388235	0.364706	1.000000

代码

文本

[22]

# Show matrix with background gradient

cm = sns.light_palette("green", as_cmap=True)

kinase_similarity_matrix_identity_df.style.background_gradient(cmap=cm).format("{:.3f}")

	EGFR	ErbB2	p110a	KDR	BRAF	CDK2	LCK	MET	p38a
EGFR	1.000	0.894	0.118	0.471	0.376	0.318	0.447	0.459	0.388
ErbB2	0.894	1.000	0.118	0.435	0.400	0.329	0.424	0.471	0.400
p110a	0.118	0.118	1.000	0.153	0.153	0.106	0.141	0.106	0.141
KDR	0.471	0.435	0.153	1.000	0.400	0.341	0.435	0.471	0.388
BRAF	0.376	0.400	0.153	0.400	1.000	0.329	0.388	0.376	0.376
CDK2	0.318	0.329	0.106	0.341	0.329	1.000	0.376	0.365	0.471
LCK	0.447	0.424	0.141	0.435	0.388	0.376	1.000	0.400	0.388
MET	0.459	0.471	0.106	0.471	0.376	0.365	0.400	1.000	0.365
p38a	0.388	0.400	0.141	0.388	0.376	0.471	0.388	0.365	1.000

代码

文本

Kinase similarity matrix: Substitution

代码

文本

[23]

kinase_similarity_matrix_substitution_df = pairwise_kinase_similarities(

kinase_sequences_dict, similarity_measure="substitution"

)

kinase_similarity_matrix_substitution_df

# NBVAL_CHECK_OUTPUT

	EGFR	ErbB2	p110a	KDR	BRAF	CDK2	LCK	MET	p38a
EGFR	1.000000	0.940963	0.427062	0.716047	0.655028	0.648447	0.711321	0.711258	0.644644
ErbB2	0.940963	1.000000	0.413638	0.702118	0.654600	0.630308	0.685075	0.697967	0.635173
p110a	0.427062	0.413638	1.000000	0.422705	0.436459	0.423994	0.451336	0.393699	0.431194
KDR	0.716047	0.702118	0.422705	1.000000	0.671268	0.653379	0.687648	0.713806	0.653444
BRAF	0.655028	0.654600	0.436459	0.671268	1.000000	0.646755	0.672933	0.638158	0.637912
CDK2	0.648447	0.630308	0.423994	0.653379	0.646755	1.000000	0.681253	0.656025	0.723093
LCK	0.711321	0.685075	0.451336	0.687648	0.672933	0.681253	1.000000	0.690879	0.662581
MET	0.711258	0.697967	0.393699	0.713806	0.638158	0.656025	0.690879	1.000000	0.629355
p38a	0.644644	0.635173	0.431194	0.653444	0.637912	0.723093	0.662581	0.629355	1.000000

代码

文本

[24]

# Show matrix with background gradient

cm = sns.light_palette("green", as_cmap=True)

kinase_similarity_matrix_substitution_df.style.background_gradient(cmap=cm).format("{:.3f}")

	EGFR	ErbB2	p110a	KDR	BRAF	CDK2	LCK	MET	p38a
EGFR	1.000	0.941	0.427	0.716	0.655	0.648	0.711	0.711	0.645
ErbB2	0.941	1.000	0.414	0.702	0.655	0.630	0.685	0.698	0.635
p110a	0.427	0.414	1.000	0.423	0.436	0.424	0.451	0.394	0.431
KDR	0.716	0.702	0.423	1.000	0.671	0.653	0.688	0.714	0.653
BRAF	0.655	0.655	0.436	0.671	1.000	0.647	0.673	0.638	0.638
CDK2	0.648	0.630	0.424	0.653	0.647	1.000	0.681	0.656	0.723
LCK	0.711	0.685	0.451	0.688	0.673	0.681	1.000	0.691	0.663
MET	0.711	0.698	0.394	0.714	0.638	0.656	0.691	1.000	0.629
p38a	0.645	0.635	0.431	0.653	0.638	0.723	0.663	0.629	1.000

代码

文本

When we compare the matrices calculated based on the identity and substitution score, the overall pattern is similar, while the values are generally higher using the substitution score.

Note: For all downstream analysis, we will only consider the kinase similarity matrix calculated based on the substitution matrix.

代码

文本

[25]

kinase_similarity_matrix_df = kinase_similarity_matrix_substitution_df

代码

文本

Save kinase similarity matrix

代码

文本

[26]

file_name = f"kinase_similarity_matrix.csv"

kinase_similarity_matrix_df.to_csv(DATA / file_name)

代码

文本

Kinase distance matrix

代码

文本

Since all entries are between $0$ and $1$ , the similarity matrix $SM$ is mapped to a distance matrix:

代码

文本

[27]

print(

f"The values of the similarity matrix lie between: "

f"{kinase_similarity_matrix_df.min().min():.2f}"

f" and {kinase_similarity_matrix_df.max().max():.2f}"

)

# NBVAL_CHECK_OUTPUT

The values of the similarity matrix lie between: 0.39 and 1.00

代码

文本

[28]

kinase_distance_matrix_df = 1 - kinase_similarity_matrix_df

代码

文本

[29]

kinase_distance_matrix_df.style.background_gradient(cmap=cm).format("{:.3f}")

	EGFR	ErbB2	p110a	KDR	BRAF	CDK2	LCK	MET	p38a
EGFR	0.000	0.059	0.573	0.284	0.345	0.352	0.289	0.289	0.355
ErbB2	0.059	0.000	0.586	0.298	0.345	0.370	0.315	0.302	0.365
p110a	0.573	0.586	0.000	0.577	0.564	0.576	0.549	0.606	0.569
KDR	0.284	0.298	0.577	0.000	0.329	0.347	0.312	0.286	0.347
BRAF	0.345	0.345	0.564	0.329	0.000	0.353	0.327	0.362	0.362
CDK2	0.352	0.370	0.576	0.347	0.353	0.000	0.319	0.344	0.277
LCK	0.289	0.315	0.549	0.312	0.327	0.319	0.000	0.309	0.337
MET	0.289	0.302	0.606	0.286	0.362	0.344	0.309	0.000	0.371
p38a	0.355	0.365	0.569	0.347	0.362	0.277	0.337	0.371	0.000

代码

文本

Save kinase distance matrix

代码

文本

[30]

file_name = f"kinase_distance_matrix.csv"

kinase_distance_matrix_df.to_csv(DATA / file_name)

代码

文本

Discussion

In this talktorial, we investigate how sequences can be used to measure similarity between kinases. The focus is set on the pocket sequence, which is retrieved from KLIFS. Sequence similarity can be assessed using two scores: 1. the identity, which treats all amino acids uniformly, and 2. the substitution, which takes into account the rate of change of residues over evolutionary time.

The kinase similarity matrix above will be reloaded in Talktorial T028, where we compare kinase similarities from different perspectives, including the pocket sequence perspective we have talked about in this talktorial.

代码

文本

Reference

https://github.com/volkamerlab/teachopencadd

Reprint statement

Original title: Kinase similarity: Sequence

Note: This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

Talia B. Kimber, 2021, Volkamer lab, Charité
Dominique Sydow, 2021, Volkamer lab, Charité
Andrea Volkamer, 2021, Volkamer lab, Charité

代码

文本