Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
IC50, EC50, Ki, Kd, Km, k_on, k_off傻傻分不清?蛋白-小分子实验数据详解与清洗(第一期)
分子对接
data analysis
化学信息学
分子对接data analysis化学信息学
昌珺涵
发布于 2023-11-10
推荐镜像 :Basic Image:bohrium-notebook:2023-04-07
推荐机型 :c2_m4_cpu
赞 10
4
protein-ligand-fep(v2)

IC50, EC50, Ki, Kd, Km, k_on, k_off傻傻分不清?蛋白-小分子实验数据详解与清洗(第一期)

©️ Copyright 2023 @ Authors
作者: 昌珺涵 📨
相关链接:本文原理部分参考 知乎ScienceSnail,分析代码参考 Schrodinger Github
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
日期:2023-11-11
快速开始:点击上方的 开始连接 按钮,选择 bohrium-notebook:2023-05-31镜像 和任意配置机型即可开始。

代码
文本

前言

预测小分子潜在药物的结合活性,是小分子CADD的核心任务之一。人们为了尽可能快而准确地预测小分子结合活性,发展了从基于分子模拟的 FEP、MMGB/PBSA,Docking 到 AI 预测的一系列方法,并随着 AI for Science 时代的到来,预测精度仍然在不断提升中。

而在同时,我们值得好好审视作为比较或是训练数据的金标准——实验数据。以下问题,希望屏幕前的你,看完 Notebook 后都能给出自信的答案:

  • 生化酶活实验得到的抑制常数和基于生物物理手段(如 ITC、SPR 等)得到的抑制常数究竟有何区别,能否等价?与 FEP 计算结果更匹配的是哪一类实验数据?
  • 我们常用的参数 IC50 真的能很好的反映出抑制剂的实际抑制强度吗?
  • 不同批次的生化酶活实验数据如何换算?
  • 将大规模活性数据用于机器学习,有什么合适的弥补实验间差距的方法?

接下来我们就先从这些常见参数的定义说起,然后讨论一下一些容易混淆的参数之间的区别。注:非特别注明,本文涉及到抑制剂时只讨论常见的竞争性抑制剂。

代码
文本

一、实验类型与其活性数据定义

在解释之前先放一张酶学反应式的图,方便后文的说明: image

注:E = 酶 enzyme; S = 底物 substrate; ES = 酶-底物复合物中间体; P = 产物 product;

I,I’ = 抑制剂 inhibitor; EI, EI’ = 酶-抑制剂复合物;

= 反应速率常数;[X] = X 的浓度

1.1 生物物理实验(也称 Binding Assay)

生物物理实验测定蛋白(酶)结合抑制剂时,物理信号的变化,如图中 (b) 式。生物物理实验不受到酶促反应的影响,体系中仅存在酶、抑制剂,是更“单纯”的抑制剂结合。

(1) ,

解离常数(dissociation constant),反映的是化合物对靶标的亲和力大小,值越小亲和力越强。

  • 化学平衡常数记为 ,二者等价。

从此公式得出: Kd 为 50% 的酶 E 被抑制剂 I’结合时对应的游离抑制剂的浓度。

  • 结合常数(association constant),与 相反,值越大亲和力越强。
  • 结合自由能 ,可与FEP等计算所得的值比较。

(2) ,

  • 结合速率常数(association rate constant),代表分子间结合时的快慢,单位为
  • 解离速率常数(dissociation rate constant),代表分子间解离时的快慢,单位为

常见的生物物理实验有等温量热滴定(Isothermal Titration Calorimetry, ITC)、表面等离激元共振(Surface Plasmon Resonance, SPR)等。

代码
文本

1.2 酶生物化学实验(也称 Functional Assay)——酶促反应动力学分析

酶生物化学实验测定的是酶结合抑制剂时,催化能力受到的影响,如图中(a)式。

我们先假设没有抑制剂 I 存在。催化反应进入稳态时,不稳定中间体 ES 浓度稳定,对反应动力学微分方程采用稳态近似法,可得:

(3)

  • 由此可定义米氏常数

  • 反应速率

米氏常数(Michaelis-Menten constant), 为酶本身的一种特征参数,其物理意义为当酶促反应达到最大反应速度一半时底物 S 的浓度。 的大小只与酶的性质有关,而与酶的浓度无关,但是随着测定的底物不同、温度、离子强度和 pH 的不同而不同。

代码
文本

1.3 酶生物化学实验——抑制剂存在下的酶促反应动力学分析

image

那么,当抑制剂 I 存在时,以上方程会有什么影响呢?

以竞争性抑制剂为例,经过同样的推导流程,代入带抑制剂的酶总浓度

可得

(4) , (half maximal inhibitory/effective concentration)

广义的讲, 是对指定的生物过程(或该过程中的某个组分比如酶、受体、细胞等)抑制一半时所需的药物或者抑制剂的浓度。药学中用于表征拮抗剂(antagonist)在体外实验(in vitro)中的拮抗能力。 是指在特定暴露时间后,能达到 50% 最大生物效应对应的药物、抗体或者毒素等的浓度。药学中除了用于表征体外实验中(in vitro)激动剂(agonist)的激活能力外,还可用于表示达到体内(in vivo)最大生物效应一半时所需的血药浓度。在很多文献中, 也用于表征某化合物在细胞水平的效力 (包括激动和拮抗)

——以上定义部分来自参考资料 [1-2]

从定义上能很明显地看出 () 两者的区别,虽然两者都是表征化合物对靶标亲和力的指标,但是 IC50 与实验中所用酶的浓度、底物浓度都有关,而 Ki 是不受这些变量的影响的。目前很多药物化学文献中都不标注自己酶活实验所用酶和底物的浓度,直接给出 IC50 数据,其实这不利于不同文献之间化合物抑制能力的比较,因为 IC50 大小是可以通过酶和底物浓度进行调节的。从这个角度上来说,Ki 才是更加准确的衡量指标,但是为何大多数文献都不用呢?这主要是由于 Ki 的测定过程稍繁琐。IC50 的测定只需要拉出抑制剂浓度梯度进行生长曲线拟合即可方便地得出。然而 Ki 的测定需要先测定该酶特定条件下的 Km,然后在某种抑制剂浓度、不同底物浓度下进行双倒数作图。

根据定义,可通过解方程 得到 。在不同形式的抑制剂(竞争性、非竞争性)下, 的关系如下表(通过副反应系数技巧可以快速推广出所有表达式):

抑制机理 (初始)反应速率表达式 表达式 双倒数图如何计算
竞争性抑制剂:结合游离酶 E
MCAT Inhibition Types
反竞争性抑制剂:结合酶-底物复合物 ES
MCAT Inhibition Types
非竞争性抑制剂:同强度结合 EES
MCAT Inhibition Types
(需测定无抑制剂下
混合竞争性抑制剂:不同强度结合 EES
MCAT Inhibition Types
(需测定无抑制剂下 ,根据截距求

参考资料:

[1] wikipedia: https://www.wikipedia.org/

[2] 王镜岩等,《生物化学》上册,高等教育出版社,第九章

[3] Cheng Y, Prusoff WH Biochem Pharmacol. 22 (23): 3099–108

代码
文本

二、不同实验测得活性对比

文献 [3] 中收录了通过 BindingDB 获取的有2种以上实验数据对比的化合物系列。我们来作个图。

代码
文本
[1]
wdir = "/bohr/protein-ligand-fep-qxoj/v2/experimental_survey_data"

binding_comparisons = [
f'{wdir}/aaron2010_spr_itc.csv',
f'{wdir}/hang2009_hcvpoly_bk_spr_fluor.csv',
f'{wdir}/jecklin2009_ms_itc.csv',
f'{wdir}/jecklin2009_ms_spr.csv',
f'{wdir}/jecklin2009_spr_itc.csv',
f'{wdir}/mandine2001_sh2_spa_spr.csv',
f'{wdir}/mason2012_spr_lan.csv',
f'{wdir}/murphy2006_binding.csv',
f'{wdir}/murthy2007_itc_spr.csv',
f'{wdir}/navratilova2007_caII_itc_spr.csv',
f'{wdir}/newman2012_spr_itc.csv',
f'{wdir}/peterson2018_itc_fp.csv',
f'{wdir}/rogez2013_bovine_itc_tsa.csv',
f'{wdir}/rogez2013_bovine_spr_itc.csv',
f'{wdir}/rogez2013_bovine_spr_tsa.csv',
f'{wdir}/rogez2013_human_itc_tsa.csv',
f'{wdir}/rogez2013_human_spr_itc.csv',
f'{wdir}/rogez2013_human_spr_tsa.csv',
f'{wdir}/schnapp2016_dppiv_itc_spr.csv',
f'{wdir}/ycas2020_bptf_labeled_spr_alphascreen.csv',
f'{wdir}/ycas2020_bptf_nmr_alphascreen.csv',
f'{wdir}/ycas2020_bptf_nmr_labeled_spr.csv',
f'{wdir}/ycas2020_bptf_nmr_spr.csv',
f'{wdir}/ycas2020_bptf_spr_alphascreen.csv'
]

binding_vs_functional_comparisons = [
f'{wdir}/baum2009_thrombin_ki_itc.csv',
f'{wdir}/hang2009_hcvpoly_bk_ic50_fluor.csv',
f'{wdir}/hang2009_hcvpoly_con1_ic50_fluor.csv',
f'{wdir}/kung2011_hsp90_spa_itc.csv',
f'{wdir}/li2016_dppiv_spr_ic50.csv',
f'{wdir}/mason2012_lan_ic50.csv',
f'{wdir}/mason2012_spr_ic50.csv',
f'{wdir}/nikolovska2004_xiap_ki_kd.csv',
f'{wdir}/crawford2016_binding_BRD41.csv',
f'{wdir}/crawford2016_binding_BRD42.csv',
f'{wdir}/crawford2016_binding_BRD9.csv',
f'{wdir}/crawford2016_binding_BRPF1.csv',
f'{wdir}/crawford2016_binding_CECR2.csv',
f'{wdir}/crawford2016_binding_CREBBP.csv',
f'{wdir}/crawford2016_binding_TAF12.csv',
f'{wdir}/patil2018_ic50_spr_nottopbottom.csv',
f'{wdir}/schnapp2016_dppiv_ic50_spr.csv',
f'{wdir}/schnapp2016_dppiv_itc_ic50.csv',
f'{wdir}/schindler2020_functional_spr.csv'
]

functional_comparisons = [
f'{wdir}/chen2013_inhibition.csv',
f'{wdir}/jia2006_cot_inhibition_nottopbottom.csv',
f'{wdir}/katz2003_upa_competition.csv',
f'{wdir}/moonshot2020_covid_protease_inhibition_nottopbottom.csv'
]
all_comparitive_files = binding_comparisons + binding_vs_functional_comparisons + functional_comparisons
代码
文本
[ ]
# 作图部分
! jupyter labextension install jupyterlab-plotly
! pip install kaleido
import numpy as np
import pandas as pd

def corr_plot_plotly(df: pd.DataFrame, x_col: str, y_col: str, error_x: str = None, error_y: str = None, name: str = None,
color: str = None, category: str = None, custom_data: list = None,
title: str = None, xlabel: str = None, ylabel: str = None, outputfile: str = "corr_plot.png",
template="plotly", size=12, width=1024, height=928):
try:
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

my_template = pio.templates[template]
new_sequence = ['circle', 'square', 'diamond', 'cross', 'x', 'triangle-left', 'triangle-right', 'octagon',
'pentagon', 'hexagon', 'star', 'hexagram', 'star-triangle-up', 'star-triangle-down', 'star-square', 'star-diamond']

my_template.data.scatter = [
go.Scatter(marker=dict(symbol=s,
size=size,
opacity=0.8,
line=dict(color='white', width=size / 8)
)
) for s in new_sequence
]
except:
print("Please install plotly to use this function.")
return None
df2 = df.copy()
df2.fillna(0)

fig = px.scatter(df2, x=x_col, y=y_col, error_x=error_x, error_y=error_y, hover_name=name,
color=color, symbol=category, custom_data=custom_data,
title=title,
width=width, height=height, template=my_template)
for i, figdata in enumerate(fig.data):
website = "https://static01.dp.tech/openfiles.mlops.dp.tech/v1/projects/launching/596677e13dd54ef2b939b335f74f11ba/viewer.html?smiles=CC(=O)Oc1ccccc1C(=O)O"
# figdata.hovertemplate = figdata.hovertemplate.replace("<br><br>", f"<br><iframe width=100px height=100px src=\"{website}\"></iframe><br>")
# figdata.hovertemplate = '<br>'.join(['%s: %%{customdata[%d]}' % (x_col, ii) for ii in range(len(custom_data))])

# 添加阴影区域
x_shadow = [df2[[x_col, y_col]].min().min() - 1, df2[[x_col, y_col]].max().max() + 1]
y_upper_2 = [val + 2 for val in x_shadow]
y_lower_2 = [val - 2 for val in x_shadow]
y_upper_1 = [val + 1 for val in x_shadow]
y_lower_1 = [val - 1 for val in x_shadow]

shadow_trace_2 = go.Scatter(
x=x_shadow + x_shadow[::-1],
y=y_upper_2 + y_lower_2[::-1],
fill='toself',
fillcolor='rgba(0, 0, 0, 0.1)',
line=dict(color='rgba(0, 0, 255, 0)'),
showlegend=False
)

shadow_trace_1 = go.Scatter(
x=x_shadow + x_shadow[::-1],
y=y_upper_1 + y_lower_1[::-1],
fill='toself',
fillcolor='rgba(0, 0, 0, 0.2)',
line=dict(color='rgba(0, 0, 255, 0)'),
showlegend=False
)

layout = go.Layout(
title=fig.layout["title"].to_plotly_json(),
xaxis=dict(range=x_shadow, autorange=False, tickformat=".2f", **fig.layout["xaxis"].to_plotly_json()),
yaxis=dict(range=x_shadow, autorange=False, tickformat=".2f", **fig.layout["yaxis"].to_plotly_json()),
width=width,
height=height,
hovermode="x unified",
hoverdistance=1
)
fig2 = go.Figure(data=[shadow_trace_2, shadow_trace_1, *fig.data], layout=layout)

if outputfile:
if outputfile.endswith(".png"):
open(outputfile, 'wb').write(fig2.to_image("png"))

return fig2
代码
文本
[3]
# 统计量计算部分
from scipy import stats
from itertools import combinations

def read_exp_csv(filename):
df = pd.read_csv(filename)
dgs = [df[d] for d in df]
if len(dgs) != 2:
raise Exception(f'Only 2 columns are expected in CSV file. File {filename} contains {len(dgs)} columns.')
exp1 = df.columns[0].split("-")[0].rstrip()
exp2 = df.columns[1].split("-")[0].rstrip()
exp_vs = f"{exp1} vs {exp2}"

return np.array(dgs[0]), np.array(dgs[1]), exp_vs


def weighted_mean(num_per_set, value_per_set):
"""
Calculate the weighted mean for a value in the experimental survey or FEP benchmark.

Parameters
----------
num_per_set: np.array
The weighting attached to a value
value_per_set: np.array
The array that will have its weighted mean calculated

Returns
-------
the weighted mean: float
"""
return np.sum(num_per_set*value_per_set) / np.sum(num_per_set)


def weighted_rmsd(num_per_set, rmsd_per_set):
"""
Calculate the overall weighted root-mean-square of a set of root-mean-squares (RMS). The calculation first
calculates the weighted mean-square before returning the square root.

Parameters
----------
num_per_set: np.array
The weighting applied to each RMS value.
rmsd_per_set: np.array
The RMS values that will be aggregated.

Returns
-------
the weighted RMS.
"""
return np.sqrt(np.sum(num_per_set * rmsd_per_set**2) / np.sum(num_per_set))


def get_bootstrap_weighted_value(num_set, value_set, nboots=10000):
"""
Return the bootstrap mean of an array along with uncertainty.

Parameters
----------
num_set: np.array
The weighting applied to each value in the value_set
value_set: np.array
The array that will have its bootstrap mean calculated.
nboots: int
The number of boostrap samples to take.

Returns
-------
the boostrap mean: float
the bootstrap standard error: float
the bootstrap 2.5% confidence limit: float
the boostrap 97.5% confidence limit: float
"""
value_samples = np.zeros(nboots)
for b in range(nboots):
inds = np.random.choice(len(value_set), len(value_set))
value_samples[b] = weighted_mean(num_set[inds], value_set[inds])

return value_samples.mean(), value_samples.std(), np.percentile(value_samples, 2.5), np.percentile(value_samples, 97.5)


def get_bootstrap_weighted_rmsd(num_set, rmsd_set, nboots=10000):
"""
Calculate the boostrap estimate of the overall weighted root-mean-square of a set of root-mean-squares (RMS)

Parameters
----------
num_set: np.array
The weighting applied to each value in the rmsd_set
rmsd_set: np.array
The array that contains root-mean-squares that will have their boostrap mean calculated.
nboots: int
The number of boostrap samples to take.

Returns
-------
the boostrap mean: float
the bootstrap standard error: float
the bootstrap 2.5% confidence limit: float
the boostrap 97.5% confidence limit: float
"""
rmsd_samples = np.zeros(nboots)
for b in range(nboots):
inds = np.random.choice(len(rmsd_set), len(rmsd_set))
rmsd_samples[b] = weighted_rmsd(num_set[inds], rmsd_set[inds])

return rmsd_samples.mean(), rmsd_samples.std(), np.percentile(rmsd_samples, 2.5), np.percentile(rmsd_samples, 97.5)


def get_absolute_stats(dg_set1, dg_set2, verbose=True):
"""
Get the square of Pearson correlation coefficient and Kendall's tau
"""
r2 = np.corrcoef(dg_set1, dg_set2)[0,1]**2
t = stats.kendalltau(dg_set1, dg_set2).correlation
if verbose:
print('R-squared = {}'.format(np.round(r2,2)))
print("Kendaull's tau = {}".format(np.round(t,2)))
# Also through in absolute deviation
abs_diff = np.abs(dg_set1 - dg_set2)
rmsd = np.sqrt(np.mean(abs_diff**2))
if verbose:
print('Absolute MAE = {} kcal/mol'.format(np.round(abs_diff.mean(), 2)))
print('Absolute RMSE = {} kcal/mol'.format(np.round(rmsd,2)))
return rmsd, abs_diff.mean(), r2, t


def get_pairwise_diffs(data1, data2, inds=None, verbose=True):
"""
Calculate the pairwise differences between two data series.

Parameters
----------
data1: list-like of floats
The data series for one variable.
data2: List-like of floats
The data series for the second variable.
inds: list-like of ints
The indices of the elements in the data series that will be used to calculate the pairwise error. By default,
all indices are used.
verbose: bool
Whether to print out a summary of the pairwise errors.

Returns
-------
diffs: numpy.ndarray
The array of all pairwise differences. If the data series have a length of N, then the length of diffs will be
N * (N - 1) / 2.
"""
if len(data1) != len(data2):
raise Exception('Length of inputs do not match')
if inds is None:
inds = np.arange(len(data1))

pairwise_diffs1 = np.array([data1[i] - data1[j] for i, j in combinations(inds, 2)])
pairwise_diffs2 = np.array([data2[i] - data2[j] for i, j in combinations(inds, 2)])
diffs = pairwise_diffs2 - pairwise_diffs1

if verbose:
rmsd = np.sqrt(np.mean(diffs ** 2))
print(len(inds), 'total compounds')
print(len(pairwise_diffs1), 'pairwise differences to compare')
print('Pairwise MUE = {:.2f} kcal/mol'.format(np.mean(np.abs(diffs))))
print('Pairwise RMSD = {:.2f} kcal/mol'.format(rmsd))

return diffs


def parse_experimental_data(files, notlist=()):
"""
Collect the of experimental errors from a set of csv files that contain all the comparative data.

Parameters
----------
files: list-like
The paths to all the files that will have their error aggregated.
notlist: list-like
The names of the csv files that will be omitted from the analysis.

Returns
-------
results: dict
Contains the arrays of the data RMSEs, MUEs, and correlation statistics.
"""
results = {'entries':[], 'number':[], 'Pairwise RMSE':[], 'Pairwise MUE':[], 'Absolute RMSE':[], 'Absolute MUE':[],
'R-squared':[], 'Kendall tau':[]}

pairwise_diffs = []
data = []
for f in files:
entry = f.split('/')[-1].split('.')[0]
if entry in notlist:
#print('Skipping {}'.format(entry))
pass
else:
# Collect the aggregate stats
results['entries'].append(entry)
# Correlation stats
dg1, dg2, exp_vs = read_exp_csv(f)
lit = entry.split("-")[0]
n = len(dg1)
df = pd.DataFrame({"Ref.": [lit] * n, "Experiment": [exp_vs] * n, "dG1": dg1, "dG2": dg2})
data.append(df)
rmsd, mae, c1, c2 = get_absolute_stats(dg1, dg2, verbose=False)
results['R-squared'].append(c1)
results['Kendall tau'].append(c2)
results['number'].append(len(dg1))
# Abosulute errors
results['Absolute RMSE'].append(np.sqrt(np.mean((dg1 - dg2)**2)))
results['Absolute MUE'].append(np.mean(np.abs(dg1 - dg2)))
# Pairwise errors
diffs = get_pairwise_diffs(dg1, dg2, verbose=False)
results['Pairwise RMSE'].append(np.sqrt(np.mean(diffs**2)))
results['Pairwise MUE'].append(np.mean(np.abs(diffs)))

# Store the pairwise differences
pairwise_diffs.extend(diffs)
data_all = pd.concat(data).reset_index(drop=True)

for key in results:
results[key] = np.array(results[key])

return results, np.array(pairwise_diffs), data_all


def summarize_experimental_error(files, notlist=()):
"""
Calculate and print the summary statistics from the set of csv files that contail the experimental comparison data.

Parameters
----------
files: list-like
The paths to all the files that will have their error aggregated.
notlist: list-like
The names of the csv files that will be omitted from the analysis.
Returns
----------
fig : plotly.Figure
stat : pd.DataFrame
"""

results, diffs, data_all = parse_experimental_data(files, notlist)

print('Total number of comparison data points (including repeated ligands) =', np.sum(results['number']))
print()

results_summary = {}
for key in results:
unit = " kcal/mol" if key not in ["R-squared", "Kendall tau"] else ""
if key == 'number' or key == 'entries':
if key == "number":
results_summary[key] = [sum(results['number']), sum(results['number'])]
pass
elif 'RMSE' in key:
boot_mean, boot_std, boot_lower, boot_upper = get_bootstrap_weighted_rmsd(results['number'], results[key])
mean = weighted_rmsd(results['number'], results[key])
# print('Weighted {} = {:.2}{}'.format(key, mean, unit))
# print('Weighted bootstap {} = {:.2} [{:.2f}, {:.2f}]{}\n'.format(key, boot_mean, boot_lower, boot_upper, unit))
results_summary[key] = ["{:.2}".format(mean), "{:.2} [{:.2f}, {:.2f}]".format(boot_mean, boot_lower, boot_upper)]
else:
boot_mean, boot_std, boot_lower, boot_upper = get_bootstrap_weighted_value(results['number'], results[key])
mean = weighted_mean(results['number'], results[key])
# print('Weighted {} = {:.2}{}'.format(key, mean, unit))
# print('Weighted bootstap {} = {:.2} [{:.2f}, {:.2f}]{}\n'.format(key, boot_mean, boot_lower, boot_upper, unit))
results_summary[key] = ["{:.2}".format(mean), "{:.2} [{:.2f}, {:.2f}]".format(boot_mean, boot_lower, boot_upper)]
stat = pd.DataFrame(results_summary, index=["Direct", "Bootstrap"])
fig = corr_plot_plotly(data_all, x_col="dG1", y_col="dG2", name="Experiment", category="Experiment", color="Ref.", outputfile=None)
return fig, stat
代码
文本
[16]
print('The overall experimental error in the survey')
print('--------------------------------------------')
fig, stat = summarize_experimental_error(all_comparitive_files)
fig.show()
stat
The overall experimental error in the survey
--------------------------------------------
/opt/conda/lib/python3.8/site-packages/numpy/lib/function_base.py:2829: RuntimeWarning:

invalid value encountered in true_divide

/opt/conda/lib/python3.8/site-packages/numpy/lib/function_base.py:2830: RuntimeWarning:

invalid value encountered in true_divide

Total number of comparison data points (including repeated ligands) = 643

−15.00−10.00−5.000.00−15.00−10.00−5.000.00
aaron2010_spr_itc, vs hang2009_hcvpoly_bk_spr_fluor, vs jecklin2009_ms_itc, vs jecklin2009_ms_spr, vs jecklin2009_spr_itc, vs mandine2001_sh2_spa_spr, vs mason2012_spr_lan, vs murphy2006_binding, vs murthy2007_itc_spr, vs navratilova2007_caII_itc_spr, vs newman2012_spr_itc, vs peterson2018_itc_fp, vs rogez2013_bovine_itc_tsa, vs rogez2013_bovine_spr_itc, vs rogez2013_bovine_spr_tsa, vs rogez2013_human_itc_tsa, vs rogez2013_human_spr_itc, vs rogez2013_human_spr_tsa, vs schnapp2016_dppiv_itc_spr, vs ycas2020_bptf_labeled_spr_alphascreen, vs ycas2020_bptf_nmr_alphascreen, vs ycas2020_bptf_nmr_labeled_spr, vs ycas2020_bptf_nmr_spr, vs ycas2020_bptf_spr_alphascreen, vs baum2009_thrombin_ki_itc, vs hang2009_hcvpoly_bk_ic50_fluor, vs hang2009_hcvpoly_con1_ic50_fluor, vs kung2011_hsp90_spa_itc, vs li2016_dppiv_spr_ic50, vs mason2012_lan_ic50, vs mason2012_spr_ic50, vs nikolovska2004_xiap_ki_kd, vs crawford2016_binding_BRD41, vs crawford2016_binding_BRD42, vs crawford2016_binding_BRD9, vs crawford2016_binding_BRPF1, vs crawford2016_binding_CECR2, vs crawford2016_binding_CREBBP, vs crawford2016_binding_TAF12, vs patil2018_ic50_spr_nottopbottom, vs schnapp2016_dppiv_ic50_spr, vs schnapp2016_dppiv_itc_ic50, vs schindler2020_functional_spr, vs chen2013_inhibition, vs jia2006_cot_inhibition_nottopbottom, vs katz2003_upa_competition, vs moonshot2020_covid_protease_inhibition_nottopbottom, vs dG1dG2
number Pairwise RMSE Pairwise MUE Absolute RMSE Absolute MUE R-squared Kendall tau
Direct 643 0.91 0.67 0.76 0.53 nan nan
Bootstrap 643 0.94 [0.80, 1.16] 0.69 [0.60, 0.86] 0.8 [0.64, 1.06] 0.57 [0.45, 0.78] nan [nan, nan] nan [nan, nan]
代码
文本
[5]
print('Biophysical vs biophysical error')
print('--------------------------------')
print(f'Number of comparisons = {len(binding_comparisons)}')
fig, stat = summarize_experimental_error(binding_comparisons)
fig.show()
stat
Biophysical vs biophysical error
--------------------------------
Number of comparisons = 24
Total number of comparison data points (including repeated ligands) = 114

−16.00−14.00−12.00−10.00−8.00−6.00−4.00−16.00−14.00−12.00−10.00−8.00−6.00−4.00
aaron2010_spr_itc, vs hang2009_hcvpoly_bk_spr_fluor, vs jecklin2009_ms_itc, vs jecklin2009_ms_spr, vs jecklin2009_spr_itc, vs mandine2001_sh2_spa_spr, vs mason2012_spr_lan, vs murphy2006_binding, vs murthy2007_itc_spr, vs navratilova2007_caII_itc_spr, vs newman2012_spr_itc, vs peterson2018_itc_fp, vs rogez2013_bovine_itc_tsa, vs rogez2013_bovine_spr_itc, vs rogez2013_bovine_spr_tsa, vs rogez2013_human_itc_tsa, vs rogez2013_human_spr_itc, vs rogez2013_human_spr_tsa, vs schnapp2016_dppiv_itc_spr, vs ycas2020_bptf_labeled_spr_alphascreen, vs ycas2020_bptf_nmr_alphascreen, vs ycas2020_bptf_nmr_labeled_spr, vs ycas2020_bptf_nmr_spr, vs ycas2020_bptf_spr_alphascreen, vs dG1dG2
number Pairwise RMSE Pairwise MUE Absolute RMSE Absolute MUE R-squared Kendall tau
Direct 114 1.2 0.82 1.0 0.75 0.72 0.64
Bootstrap 114 1.1 [0.82, 1.43] 0.82 [0.60, 1.04] 1.0 [0.74, 1.28] 0.74 [0.53, 0.95] 0.72 [0.57, 0.86] 0.65 [0.53, 0.77]
代码
文本
[6]
print('Biophysical vs biochemical error')
print('---------------------------------')
print(f'Number of comparisons = {len(binding_vs_functional_comparisons)}')
fig, stat = summarize_experimental_error(binding_vs_functional_comparisons)
fig.show()
stat
Biophysical vs biochemical error
---------------------------------
Number of comparisons = 19
Total number of comparison data points (including repeated ligands) = 150

/opt/conda/lib/python3.8/site-packages/numpy/lib/function_base.py:2829: RuntimeWarning:

invalid value encountered in true_divide

/opt/conda/lib/python3.8/site-packages/numpy/lib/function_base.py:2830: RuntimeWarning:

invalid value encountered in true_divide

−16.00−14.00−12.00−10.00−8.00−6.00−4.00−16.00−14.00−12.00−10.00−8.00−6.00−4.00
baum2009_thrombin_ki_itc, vs hang2009_hcvpoly_bk_ic50_fluor, vs hang2009_hcvpoly_con1_ic50_fluor, vs kung2011_hsp90_spa_itc, vs li2016_dppiv_spr_ic50, vs mason2012_lan_ic50, vs mason2012_spr_ic50, vs nikolovska2004_xiap_ki_kd, vs crawford2016_binding_BRD41, vs crawford2016_binding_BRD42, vs crawford2016_binding_BRD9, vs crawford2016_binding_BRPF1, vs crawford2016_binding_CECR2, vs crawford2016_binding_CREBBP, vs crawford2016_binding_TAF12, vs patil2018_ic50_spr_nottopbottom, vs schnapp2016_dppiv_ic50_spr, vs schnapp2016_dppiv_itc_ic50, vs schindler2020_functional_spr, vs dG1dG2
number Pairwise RMSE Pairwise MUE Absolute RMSE Absolute MUE R-squared Kendall tau
Direct 150 1.1 0.81 0.97 0.73 nan nan
Bootstrap 150 1.0 [0.85, 1.16] 0.8 [0.65, 0.90] 0.96 [0.76, 1.15] 0.72 [0.58, 0.88] nan [nan, nan] nan [nan, nan]
代码
文本
[7]
print('Biochemical vs biochemical error')
print('---------------------------------')
print(f'Number of comparisons = {len(functional_comparisons)}')
fig, stat = summarize_experimental_error(functional_comparisons)
fig.show()
stat
Biochemical vs biochemical error
---------------------------------
Number of comparisons = 4
Total number of comparison data points (including repeated ligands) = 379

−10.00−5.000.00−12.00−10.00−8.00−6.00−4.00−2.000.002.004.00
chen2013_inhibition, vs jia2006_cot_inhibition_nottopbottom, vs katz2003_upa_competition, vs moonshot2020_covid_protease_inhibition_nottopbottom, vs dG1dG2
number Pairwise RMSE Pairwise MUE Absolute RMSE Absolute MUE R-squared Kendall tau
Direct 379 0.75 0.57 0.54 0.39 0.77 0.71
Bootstrap 379 0.67 [0.31, 0.79] 0.51 [0.25, 0.60] 0.49 [0.35, 0.57] 0.37 [0.25, 0.41] 0.81 [0.75, 1.00] 0.75 [0.69, 1.00]
代码
文本

可见,生物物理实验两两之间、生物化学实验两两之间,吻合度大于 生物物理实验vs生物化学实验。

我们可以认为, 的实验随机误差在 0.75 kcal/mol 以内, 的实验随机误差在 0.91 kcal/mol 以内。

代码
文本

三、数据使用建议与清洗实践

BindingDB 是目前最大的结合活性数据集。其中的 Validation Set 汇总了通过分子相似性归类的化合物系列,非常适合用于比较相对活性差距。

然而,Validation Set 中包含了许多由多篇文献拼合成的 Chemical Series,由于文献实验类型和条件不同,实际上文献间不可比。

alt image.png

因此,若想仍然保留多篇文献拼合成的 Chemical Series 而不将其切分为单篇文献,就需要根据 3.3 中的表格换算。酶的米氏常数可以通过查找 BRENDA-enzymes 数据库 获得。

对大量文献中的 Binding Affinity 实验 protocol 提取,可以使用大语言模型。

alt image.png

代码
文本

从另一个角度,既然实验间的校准这么困难,我们可以仅使用实验内的相对值作为比较和训练目标。这是我们下一期的主题!

代码
文本
[ ]

代码
文本
分子对接
data analysis
化学信息学
分子对接data analysis化学信息学
已赞10
本文被以下合集收录
药物计算
bohr1ae504
更新于 2024-05-11
2 篇0 人关注
bk
longjing@dp.tech
更新于 2024-04-25
1 篇0 人关注
推荐阅读
公开
化学信息学 | 无需模板的互变异构体枚举算法
化学信息学RDKit
化学信息学RDKit
Weiliang Luo
发布于 2023-07-13
3 赞
公开
RDKit | 使用RECAP拆解化合物
化学信息学RDKit
化学信息学RDKit
YangHe
发布于 2023-06-13