AI4S Cup 酶功能与突变序列间的关系预测 工具介绍
©️ Copyright 2024 @ Authors
作者: zhangjun@dp.tech📨
日期:2024-04-01
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:点击上方的 开始连接 按钮,选择公共镜像third-party software中的 esm-mpnn-progen:1.0 镜像及 c12_m46_1 * NVIDIA GPU B 节点配置,由于需要加载esm的模型参数,大概15分钟即可运行。
AI4SCUP赛事说明: 本教程仅供选手参考,提供一些常用工具的使用介绍,为选手制定合适的方法提供一些灵感。
对于酶功能的预测还没有出现具有一致性和普适性的算法,基于AI的算法在酶功能的设计和预测上的成功率和准确率仍然较低。针对RhlA合成酶,我们可以尝试现有的一些酶设计及打分工具,根据已有的实际实验数据,开发定制化的预测模型,以期在该体系的酶功能预测上取得更高的准确率和可靠性,从而指导高性能突变序列的设计。
酶蛋白结构预测及设计常用工具
1.结构预测工具
https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb
https://colab.research.google.com/github/dptech-corp/Uni-Fold/blob/main/notebooks/unifold.ipynb#scrollTo=jMGcXXPabEN4
https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/RoseTTAFold2.ipynb
2. 蛋白序列生成及打分
https://github.com/salesforce/progen/tree/main/progen2
https://github.com/pagnani/ArDCA.jl/blob/master/python-notebook/arDCA_sklearn.ipynb
3. 蛋白设计及打分
https://github.com/dauparas/LigandMPNN/tree/main https://github.com/facebookresearch/esm/tree/2b369911bb5b4b0dda914521b9475cad1656b2ac/examples/inverse_folding
4. 蛋白序列特征泛化
https://github.com/facebookresearch/esm
接下来简单介绍了LigandMPNN,ESM,Progen2三种工具在蛋白/酶序列设计或功能预测上的应用,详细教程请查阅原代码。
LigandMPNN的使用
LigandMPNN是一种基于深度学习的蛋白质序列设计方法。以酶-小分子的复合体结构为输入,通过score.py打分即可获得每个位点的20种突变概率。
创建ligandMPNN环境
获得复合体结构和序列文件
从PDB网站获取RhlA的结构文件,PDB id: 8ik2
序列rcsb_pdb_8IK2.fasta:
MRRESLLVSVCKGLRVHVERVGQDPGRSTVMLVNGAMATTASFARTCKCLAEHFNVVLFDLPFAGQSRQHNPQRGLITKDDEVEILLALIERFEVNHLVSASWGGISTLLALSRNPRGIRSSVVMAFAPGLNQAMLDYVGRAQALIELDDKSAIGHLLNETVGKYLPQRLKASNHQHMASLATGEYEQARFHIDQVLALNDRGYLACLERIQSHVHFINGSWDEYTTAEDARQFRDYLPHCSFSRVEGTGHFLDLESKLAAVRVHRALLEHLLKQPEPQRAERAAGFHEMAIGYAHHHHHH
更改为(与pdb结构文件保持一致):RRESLLVSVCKGLRVHVERVGQDPGRSTVMLVNGAMATTASFARTCKCLAEHFNVVLFDLPFAGQSRQHNPGLITKDDEVEILLALIERFEVNHLVSASWGGISTLLALSRNPRGIRSSVVMAFAPGLNQAMLDYVGRAQALIELDDKSAIGHLLNETVGKYLPQRLKASNHQHMASLATGEYEQARFHIDQVLALNDRGYLACLERIQSHVHFINGSWDEYTTAEDARQFRDYLPHCSFSRVEGTGHFLDLESKLAAVRVHRALLEHLL
在本次竞赛中,可以先借助Uni-Mol等工具预测蛋白(RhlA)-小分子的复合体结构,随后将不同选择性的复合体结构输入LigandMPNN,即可获得相应的打分。
观察结构,可以发现晶体结构缺失了73-74位残基,可以借助Unifold或Alphafold2等工具预测缺失结构。
LigandMPNN打分
建议使用single amino acid score with sequence info,自回归打分受到解码顺序的影响,可能不准确。其余功能自行探索。
UsageError: Cell magic `%%` not found.
/personal/soft/LigandMPNN
Designing protein from this path: ./inputs/8ik2.pdb These residues will be redesigned: ['A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15', 'A16', 'A17', 'A18', 'A19', 'A20', 'A21', 'A22', 'A23', 'A24', 'A25', 'A26', 'A27', 'A28', 'A29', 'A30', 'A31', 'A32', 'A33', 'A34', 'A35', 'A36', 'A37', 'A38', 'A39', 'A40', 'A41', 'A42', 'A43', 'A44', 'A45', 'A46', 'A47', 'A48', 'A49', 'A50', 'A51', 'A52', 'A53', 'A54', 'A55', 'A56', 'A57', 'A58', 'A59', 'A60', 'A61', 'A62', 'A63', 'A64', 'A65', 'A66', 'A67', 'A68', 'A69', 'A70', 'A71', 'A72', 'A75', 'A76', 'A77', 'A78', 'A79', 'A80', 'A81', 'A82', 'A83', 'A84', 'A85', 'A86', 'A87', 'A88', 'A89', 'A90', 'A91', 'A92', 'A93', 'A94', 'A95', 'A96', 'A97', 'A98', 'A99', 'A100', 'A101', 'A102', 'A103', 'A104', 'A105', 'A106', 'A107', 'A108', 'A109', 'A110', 'A111', 'A112', 'A113', 'A114', 'A115', 'A116', 'A117', 'A118', 'A119', 'A120', 'A121', 'A122', 'A123', 'A124', 'A125', 'A126', 'A127', 'A128', 'A129', 'A130', 'A131', 'A132', 'A133', 'A134', 'A135', 'A136', 'A137', 'A138', 'A139', 'A140', 'A141', 'A142', 'A143', 'A144', 'A145', 'A146', 'A147', 'A148', 'A149', 'A150', 'A151', 'A152', 'A153', 'A154', 'A155', 'A156', 'A157', 'A158', 'A159', 'A160', 'A161', 'A162', 'A163', 'A164', 'A165', 'A166', 'A167', 'A168', 'A169', 'A170', 'A171', 'A172', 'A173', 'A174', 'A175', 'A176', 'A177', 'A178', 'A179', 'A180', 'A181', 'A182', 'A183', 'A184', 'A185', 'A186', 'A187', 'A188', 'A189', 'A190', 'A191', 'A192', 'A193', 'A194', 'A195', 'A196', 'A197', 'A198', 'A199', 'A200', 'A201', 'A202', 'A203', 'A204', 'A205', 'A206', 'A207', 'A208', 'A209', 'A210', 'A211', 'A212', 'A213', 'A214', 'A215', 'A216', 'A217', 'A218', 'A219', 'A220', 'A221', 'A222', 'A223', 'A224', 'A225', 'A226', 'A227', 'A228', 'A229', 'A230', 'A231', 'A232', 'A233', 'A234', 'A235', 'A236', 'A237', 'A238', 'A239', 'A240', 'A241', 'A242', 'A243', 'A244', 'A245', 'A246', 'A247', 'A248', 'A249', 'A250', 'A251', 'A252', 'A253', 'A254', 'A255', 'A256', 'A257', 'A258', 'A259', 'A260', 'A261', 'A262', 'A263', 'A264', 'A265', 'A266', 'A267', 'A268', 'A269', 'A270', 'A271', 'A272', 'A273'] These residues will be fixed: [] The number of ligand atoms parsed is equal to: 13 Type: C, Coords [ 4.687 2.347 -12.914], Mask 1 Type: C, Coords [ 5.01 3.12 -11.618], Mask 1 Type: C, Coords [ 4.634 2.594 -10.177], Mask 1 Type: C, Coords [ 5.57 2.763 -9.026], Mask 1 Type: C, Coords [ 5.001 2.233 -7.688], Mask 1 Type: C, Coords [ 5.164 3.096 -6.482], Mask 1 Type: O, Coords [ 4.549 1.245 -10.069], Mask 1 Type: O, Coords [ 4.755 1.11 -12.976], Mask 1 Type: O, Coords [ 4.367 2.885 -13.955], Mask 1 Type: C, Coords [ 3.814 3.51 -5.931], Mask 1 Type: C, Coords [ 3.735 4.87 -5.267], Mask 1 Type: C, Coords [ 2.87 5.932 -5.892], Mask 1 Type: C, Coords [ 3.586 7.277 -5.902], Mask 1
读取打分矩阵
R19V:0.49 N33H:0.54 T45L:0.73 L60W:0.65 H69L:0.47 D78E:0.43 F90Y:0.44 A98H:0.75 S118R:0.46 V136L:0.75 G137D:0.44 D147R:0.51 M175F:0.69 T180P:0.4 Q185N:0.75 R187L:0.44 H189Y:0.43 A225P:0.44 L235I:0.8 H237N:0.53 D251W:0.44 结果已保存到 ./outputs/8ik2_single_aa_score/8ik2_single_aa_score.csv
结果示例
从图中的打分热图可以看出,R2V,L6V,K11N具有有利的打分(仅为示例)。
ESM使用示例
ESM(Evolutionary Scale Modeling)采用深度学习技术预测蛋白质的结构与功能。该方法将蛋白质序列视为一种特殊的语言,其中每个氨基酸对应一个字母,并利用Transformer模型来掌握这种语言的统计特性。通过在庞大的蛋白质序列集上训练Transformer,ESM能够学习蛋白质序列的进化模式以及序列与结构和功能之间的复杂联系。ESM的训练过程采用遮盖训练策略,即随机隐藏序列中的某些氨基酸,让模型预测这些被遮盖的氨基酸,从而深入学习序列中的长距离依赖关系和上下文信息。训练完成后,ESM输出包含蛋白质的结构和功能信息的高维向量,作为序列特征,从而用于蛋白质的下游预测和分析任务。
环境配置
通过git clone
命令下载代码仓库,通过命令conda env create -f environment.yml.
一键配置conda环境。此环境已经配置好,可以直接使用。
读取预训练特征
从安装好的ESM-2预训练模型中提取蛋白质序列的表示,这里加载esm2_t33_650M_UR50D模型,并可视化自注意力联系图。这些表示可以用于各种下游的生物信息学任务,如序列比对、结构预测或功能注释。
/opt/miniconda/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
ESM-1v介绍
ESM-1v是在UR90数据集上训练得到的通用蛋白质语言模型,其创新之处在于能够无需任何额外数据的情况下,预测氨基酸突变对蛋白质功能可能产生的影响,因此可能能够在酶突变功能预测上发挥作用。
本案例演示了如何使用esm-1v模型,来预测蛋白质变异对生物活性的影响。以β-内酰胺酶作为例子,给出了突变位点和实验测得的突变效应值。
下载所需参数,该镜像中已经下载好。
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm1v_t33_650M_UR90S_1.pt" to /root/.cache/torch/hub/checkpoints/esm1v_t33_650M_UR90S_1.pt /opt/miniconda/lib/python3.7/site-packages/fair_esm-2.0.1-py3.7.egg/esm/pretrained.py:216: UserWarning: Regression weights not found, predicting contacts will not produce correct results. Transferred model to GPU Read esm/examples/data/P62593.fasta with 5397 sequences Processing 1 of 386 batches (14 sequences) Processing 2 of 386 batches (14 sequences) Processing 3 of 386 batches (14 sequences) Processing 4 of 386 batches (14 sequences) Processing 5 of 386 batches (14 sequences) Processing 6 of 386 batches (14 sequences) Processing 7 of 386 batches (14 sequences) Processing 8 of 386 batches (14 sequences) Processing 9 of 386 batches (14 sequences) Processing 10 of 386 batches (14 sequences) Processing 11 of 386 batches (14 sequences) Processing 12 of 386 batches (14 sequences) Processing 13 of 386 batches (14 sequences) Processing 14 of 386 batches (14 sequences) Processing 15 of 386 batches (14 sequences) Processing 16 of 386 batches (14 sequences) Processing 17 of 386 batches (14 sequences) Processing 18 of 386 batches (14 sequences) Processing 19 of 386 batches (14 sequences) Processing 20 of 386 batches (14 sequences) Processing 21 of 386 batches (14 sequences) Processing 22 of 386 batches (14 sequences) Processing 23 of 386 batches (14 sequences) Processing 24 of 386 batches (14 sequences) Processing 25 of 386 batches (14 sequences) Processing 26 of 386 batches (14 sequences) Processing 27 of 386 batches (14 sequences) Processing 28 of 386 batches (14 sequences) Processing 29 of 386 batches (14 sequences) Processing 30 of 386 batches (14 sequences) Processing 31 of 386 batches (14 sequences) Processing 32 of 386 batches (14 sequences) Processing 33 of 386 batches (14 sequences) Processing 34 of 386 batches (14 sequences) Processing 35 of 386 batches (14 sequences) Processing 36 of 386 batches (14 sequences) Processing 37 of 386 batches (14 sequences) Processing 38 of 386 batches (14 sequences) Processing 39 of 386 batches (14 sequences) Processing 40 of 386 batches (14 sequences) Processing 41 of 386 batches (14 sequences) Processing 42 of 386 batches (14 sequences) Processing 43 of 386 batches (14 sequences) Processing 44 of 386 batches (14 sequences) Processing 45 of 386 batches (14 sequences) Processing 46 of 386 batches (14 sequences) Processing 47 of 386 batches (14 sequences) Processing 48 of 386 batches (14 sequences) Processing 49 of 386 batches (14 sequences) Processing 50 of 386 batches (14 sequences) Processing 51 of 386 batches (14 sequences) Processing 52 of 386 batches (14 sequences) Processing 53 of 386 batches (14 sequences) Processing 54 of 386 batches (14 sequences) Processing 55 of 386 batches (14 sequences) Processing 56 of 386 batches (14 sequences) Processing 57 of 386 batches (14 sequences) Processing 58 of 386 batches (14 sequences) Processing 59 of 386 batches (14 sequences) Processing 60 of 386 batches (14 sequences) Processing 61 of 386 batches (14 sequences) Processing 62 of 386 batches (14 sequences) Processing 63 of 386 batches (14 sequences) Processing 64 of 386 batches (14 sequences) Processing 65 of 386 batches (14 sequences) Processing 66 of 386 batches (14 sequences) Processing 67 of 386 batches (14 sequences) Processing 68 of 386 batches (14 sequences) Processing 69 of 386 batches (14 sequences) Processing 70 of 386 batches (14 sequences) Processing 71 of 386 batches (14 sequences) Processing 72 of 386 batches (14 sequences) Processing 73 of 386 batches (14 sequences) Processing 74 of 386 batches (14 sequences) Processing 75 of 386 batches (14 sequences) Processing 76 of 386 batches (14 sequences) Processing 77 of 386 batches (14 sequences) Processing 78 of 386 batches (14 sequences) Processing 79 of 386 batches (14 sequences) Processing 80 of 386 batches (14 sequences) Processing 81 of 386 batches (14 sequences) Processing 82 of 386 batches (14 sequences) Processing 83 of 386 batches (14 sequences) Processing 84 of 386 batches (14 sequences) Processing 85 of 386 batches (14 sequences) Processing 86 of 386 batches (14 sequences) Processing 87 of 386 batches (14 sequences) Processing 88 of 386 batches (14 sequences) Processing 89 of 386 batches (14 sequences) Processing 90 of 386 batches (14 sequences) Processing 91 of 386 batches (14 sequences) Processing 92 of 386 batches (14 sequences) Processing 93 of 386 batches (14 sequences) Processing 94 of 386 batches (14 sequences) Processing 95 of 386 batches (14 sequences) Processing 96 of 386 batches (14 sequences) Processing 97 of 386 batches (14 sequences) Processing 98 of 386 batches (14 sequences) Processing 99 of 386 batches (14 sequences) Processing 100 of 386 batches (14 sequences) Processing 101 of 386 batches (14 sequences) Processing 102 of 386 batches (14 sequences) Processing 103 of 386 batches (14 sequences) Processing 104 of 386 batches (14 sequences) Processing 105 of 386 batches (14 sequences) Processing 106 of 386 batches (14 sequences) Processing 107 of 386 batches (14 sequences) Processing 108 of 386 batches (14 sequences) Processing 109 of 386 batches (14 sequences) Processing 110 of 386 batches (14 sequences) Processing 111 of 386 batches (14 sequences) Processing 112 of 386 batches (14 sequences) Processing 113 of 386 batches (14 sequences) Processing 114 of 386 batches (14 sequences) Processing 115 of 386 batches (14 sequences) Processing 116 of 386 batches (14 sequences) Processing 117 of 386 batches (14 sequences) Processing 118 of 386 batches (14 sequences) Processing 119 of 386 batches (14 sequences) Processing 120 of 386 batches (14 sequences) Processing 121 of 386 batches (14 sequences) Processing 122 of 386 batches (14 sequences) Processing 123 of 386 batches (14 sequences) Processing 124 of 386 batches (14 sequences) Processing 125 of 386 batches (14 sequences) Processing 126 of 386 batches (14 sequences) Processing 127 of 386 batches (14 sequences) Processing 128 of 386 batches (14 sequences) Processing 129 of 386 batches (14 sequences) Processing 130 of 386 batches (14 sequences) Processing 131 of 386 batches (14 sequences) Processing 132 of 386 batches (14 sequences) Processing 133 of 386 batches (14 sequences) Processing 134 of 386 batches (14 sequences) Processing 135 of 386 batches (14 sequences) Processing 136 of 386 batches (14 sequences) Processing 137 of 386 batches (14 sequences) Processing 138 of 386 batches (14 sequences) Processing 139 of 386 batches (14 sequences) Processing 140 of 386 batches (14 sequences) Processing 141 of 386 batches (14 sequences) Processing 142 of 386 batches (14 sequences) Processing 143 of 386 batches (14 sequences) Processing 144 of 386 batches (14 sequences) Processing 145 of 386 batches (14 sequences) Processing 146 of 386 batches (14 sequences) Processing 147 of 386 batches (14 sequences) Processing 148 of 386 batches (14 sequences) Processing 149 of 386 batches (14 sequences) Processing 150 of 386 batches (14 sequences) Processing 151 of 386 batches (14 sequences) Processing 152 of 386 batches (14 sequences) Processing 153 of 386 batches (14 sequences) Processing 154 of 386 batches (14 sequences) Processing 155 of 386 batches (14 sequences) Processing 156 of 386 batches (14 sequences) Processing 157 of 386 batches (14 sequences) Processing 158 of 386 batches (14 sequences) Processing 159 of 386 batches (14 sequences) Processing 160 of 386 batches (14 sequences) Processing 161 of 386 batches (14 sequences) Processing 162 of 386 batches (14 sequences) Processing 163 of 386 batches (14 sequences) Processing 164 of 386 batches (14 sequences) Processing 165 of 386 batches (14 sequences) Processing 166 of 386 batches (14 sequences) Processing 167 of 386 batches (14 sequences) Processing 168 of 386 batches (14 sequences) Processing 169 of 386 batches (14 sequences) Processing 170 of 386 batches (14 sequences) Processing 171 of 386 batches (14 sequences) Processing 172 of 386 batches (14 sequences) Processing 173 of 386 batches (14 sequences) Processing 174 of 386 batches (14 sequences) Processing 175 of 386 batches (14 sequences) Processing 176 of 386 batches (14 sequences) Processing 177 of 386 batches (14 sequences) Processing 178 of 386 batches (14 sequences) Processing 179 of 386 batches (14 sequences) Processing 180 of 386 batches (14 sequences) Processing 181 of 386 batches (14 sequences) Processing 182 of 386 batches (14 sequences) Processing 183 of 386 batches (14 sequences) Processing 184 of 386 batches (14 sequences) Processing 185 of 386 batches (14 sequences) Processing 186 of 386 batches (14 sequences) Processing 187 of 386 batches (14 sequences) Processing 188 of 386 batches (14 sequences) Processing 189 of 386 batches (14 sequences) Processing 190 of 386 batches (14 sequences) Processing 191 of 386 batches (14 sequences) Processing 192 of 386 batches (14 sequences) Processing 193 of 386 batches (14 sequences) Processing 194 of 386 batches (14 sequences) Processing 195 of 386 batches (14 sequences) Processing 196 of 386 batches (14 sequences) Processing 197 of 386 batches (14 sequences) Processing 198 of 386 batches (14 sequences) Processing 199 of 386 batches (14 sequences) Processing 200 of 386 batches (14 sequences) Processing 201 of 386 batches (14 sequences) Processing 202 of 386 batches (14 sequences) Processing 203 of 386 batches (14 sequences) Processing 204 of 386 batches (14 sequences) Processing 205 of 386 batches (14 sequences) Processing 206 of 386 batches (14 sequences) Processing 207 of 386 batches (14 sequences) Processing 208 of 386 batches (14 sequences) Processing 209 of 386 batches (14 sequences) Processing 210 of 386 batches (14 sequences) Processing 211 of 386 batches (14 sequences) Processing 212 of 386 batches (14 sequences) Processing 213 of 386 batches (14 sequences) Processing 214 of 386 batches (14 sequences) Processing 215 of 386 batches (14 sequences) Processing 216 of 386 batches (14 sequences) Processing 217 of 386 batches (14 sequences) Processing 218 of 386 batches (14 sequences) Processing 219 of 386 batches (14 sequences) Processing 220 of 386 batches (14 sequences) Processing 221 of 386 batches (14 sequences) Processing 222 of 386 batches (14 sequences) Processing 223 of 386 batches (14 sequences) Processing 224 of 386 batches (14 sequences) Processing 225 of 386 batches (14 sequences) Processing 226 of 386 batches (14 sequences) Processing 227 of 386 batches (14 sequences) Processing 228 of 386 batches (14 sequences) Processing 229 of 386 batches (14 sequences) Processing 230 of 386 batches (14 sequences) Processing 231 of 386 batches (14 sequences) Processing 232 of 386 batches (14 sequences) Processing 233 of 386 batches (14 sequences) Processing 234 of 386 batches (14 sequences) Processing 235 of 386 batches (14 sequences) Processing 236 of 386 batches (14 sequences) Processing 237 of 386 batches (14 sequences) Processing 238 of 386 batches (14 sequences) Processing 239 of 386 batches (14 sequences) Processing 240 of 386 batches (14 sequences) Processing 241 of 386 batches (14 sequences) Processing 242 of 386 batches (14 sequences) Processing 243 of 386 batches (14 sequences) Processing 244 of 386 batches (14 sequences) Processing 245 of 386 batches (14 sequences) Processing 246 of 386 batches (14 sequences) Processing 247 of 386 batches (14 sequences) Processing 248 of 386 batches (14 sequences) Processing 249 of 386 batches (14 sequences) Processing 250 of 386 batches (14 sequences) Processing 251 of 386 batches (14 sequences) Processing 252 of 386 batches (14 sequences) Processing 253 of 386 batches (14 sequences) Processing 254 of 386 batches (14 sequences) Processing 255 of 386 batches (14 sequences) Processing 256 of 386 batches (14 sequences) Processing 257 of 386 batches (14 sequences) Processing 258 of 386 batches (14 sequences) Processing 259 of 386 batches (14 sequences) Processing 260 of 386 batches (14 sequences) Processing 261 of 386 batches (14 sequences) Processing 262 of 386 batches (14 sequences) Processing 263 of 386 batches (14 sequences) Processing 264 of 386 batches (14 sequences) Processing 265 of 386 batches (14 sequences) Processing 266 of 386 batches (14 sequences) Processing 267 of 386 batches (14 sequences) Processing 268 of 386 batches (14 sequences) Processing 269 of 386 batches (14 sequences) Processing 270 of 386 batches (14 sequences) Processing 271 of 386 batches (14 sequences) Processing 272 of 386 batches (14 sequences) Processing 273 of 386 batches (14 sequences) Processing 274 of 386 batches (14 sequences) Processing 275 of 386 batches (14 sequences) Processing 276 of 386 batches (14 sequences) Processing 277 of 386 batches (14 sequences) Processing 278 of 386 batches (14 sequences) Processing 279 of 386 batches (14 sequences) Processing 280 of 386 batches (14 sequences) Processing 281 of 386 batches (14 sequences) Processing 282 of 386 batches (14 sequences) Processing 283 of 386 batches (14 sequences) Processing 284 of 386 batches (14 sequences) Processing 285 of 386 batches (14 sequences) Processing 286 of 386 batches (14 sequences) Processing 287 of 386 batches (14 sequences) Processing 288 of 386 batches (14 sequences) Processing 289 of 386 batches (14 sequences) Processing 290 of 386 batches (14 sequences) Processing 291 of 386 batches (14 sequences) Processing 292 of 386 batches (14 sequences) Processing 293 of 386 batches (14 sequences) Processing 294 of 386 batches (14 sequences) Processing 295 of 386 batches (14 sequences) Processing 296 of 386 batches (14 sequences) Processing 297 of 386 batches (14 sequences) Processing 298 of 386 batches (14 sequences) Processing 299 of 386 batches (14 sequences) Processing 300 of 386 batches (14 sequences) Processing 301 of 386 batches (14 sequences) Processing 302 of 386 batches (14 sequences) Processing 303 of 386 batches (14 sequences) Processing 304 of 386 batches (14 sequences) Processing 305 of 386 batches (14 sequences) Processing 306 of 386 batches (14 sequences) Processing 307 of 386 batches (14 sequences) Processing 308 of 386 batches (14 sequences) Processing 309 of 386 batches (14 sequences) Processing 310 of 386 batches (14 sequences) Processing 311 of 386 batches (14 sequences) Processing 312 of 386 batches (14 sequences) Processing 313 of 386 batches (14 sequences) Processing 314 of 386 batches (14 sequences) Processing 315 of 386 batches (14 sequences) Processing 316 of 386 batches (14 sequences) Processing 317 of 386 batches (14 sequences) Processing 318 of 386 batches (14 sequences) Processing 319 of 386 batches (14 sequences) Processing 320 of 386 batches (14 sequences) Processing 321 of 386 batches (14 sequences) Processing 322 of 386 batches (14 sequences) Processing 323 of 386 batches (14 sequences) Processing 324 of 386 batches (14 sequences) Processing 325 of 386 batches (14 sequences) Processing 326 of 386 batches (14 sequences) Processing 327 of 386 batches (14 sequences) Processing 328 of 386 batches (14 sequences) Processing 329 of 386 batches (14 sequences) Processing 330 of 386 batches (14 sequences) Processing 331 of 386 batches (14 sequences) Processing 332 of 386 batches (14 sequences) Processing 333 of 386 batches (14 sequences) Processing 334 of 386 batches (14 sequences) Processing 335 of 386 batches (14 sequences) Processing 336 of 386 batches (14 sequences) Processing 337 of 386 batches (14 sequences) Processing 338 of 386 batches (14 sequences) Processing 339 of 386 batches (14 sequences) Processing 340 of 386 batches (14 sequences) Processing 341 of 386 batches (14 sequences) Processing 342 of 386 batches (14 sequences) Processing 343 of 386 batches (14 sequences) Processing 344 of 386 batches (14 sequences) Processing 345 of 386 batches (14 sequences) Processing 346 of 386 batches (14 sequences) Processing 347 of 386 batches (14 sequences) Processing 348 of 386 batches (14 sequences) Processing 349 of 386 batches (14 sequences) Processing 350 of 386 batches (14 sequences) Processing 351 of 386 batches (14 sequences) Processing 352 of 386 batches (14 sequences) Processing 353 of 386 batches (14 sequences) Processing 354 of 386 batches (14 sequences) Processing 355 of 386 batches (14 sequences) Processing 356 of 386 batches (14 sequences) Processing 357 of 386 batches (14 sequences) Processing 358 of 386 batches (14 sequences) Processing 359 of 386 batches (14 sequences) Processing 360 of 386 batches (14 sequences) Processing 361 of 386 batches (14 sequences) Processing 362 of 386 batches (14 sequences) Processing 363 of 386 batches (14 sequences) Processing 364 of 386 batches (14 sequences) Processing 365 of 386 batches (14 sequences) Processing 366 of 386 batches (14 sequences) Processing 367 of 386 batches (14 sequences) Processing 368 of 386 batches (14 sequences) Processing 369 of 386 batches (14 sequences) Processing 370 of 386 batches (14 sequences) Processing 371 of 386 batches (14 sequences) Processing 372 of 386 batches (14 sequences) Processing 373 of 386 batches (14 sequences) Processing 374 of 386 batches (14 sequences) Processing 375 of 386 batches (14 sequences) Processing 376 of 386 batches (14 sequences) Processing 377 of 386 batches (14 sequences) Processing 378 of 386 batches (14 sequences) Processing 379 of 386 batches (14 sequences) Processing 380 of 386 batches (14 sequences) Processing 381 of 386 batches (14 sequences) Processing 382 of 386 batches (14 sequences) Processing 383 of 386 batches (14 sequences) Processing 384 of 386 batches (14 sequences) Processing 385 of 386 batches (14 sequences) Processing 386 of 386 batches (7 sequences)
加载embeddings(Xs)以及对应目标效应(ys),并划分训练集和测试集。
5397 (5397, 1280)
((4317, 1280), (1080, 1280), 4317, 1080)
使用PCA对1280维特征的数据进行降维,将特征降到60维。
(4317, 60)
绘制降维后的前两个主成分,点的颜色显示表示突变效应的大小。从视觉上看,esm表征对于区分不同突变效应是有一定效果的,但分离并不明显。
<matplotlib.colorbar.Colorbar at 0x7fb8025c32d0>
接下来,使用scikit-learn库的网格搜索方法来优化模型的参数,定义了三种不同的回归模型,使用PCA进行数据降维,训练数据Xs_train和ys_train。
<class 'sklearn.neighbors._regression.KNeighborsRegressor'> Fitting 5 folds for each of 48 candidates, totalling 240 fits <class 'sklearn.svm._classes.SVR'> Fitting 5 folds for each of 12 candidates, totalling 60 fits <class 'sklearn.ensemble._forest.RandomForestRegressor'> Fitting 5 folds for each of 16 candidates, totalling 80 fits
KNeighborsRegressor(algorithm='brute', leaf_size=15, p=1, weights='distance')
根据在测试集上的分数mean_test_score打印每个模型排名前五的参数设置
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_model | param_model__algorithm | param_model__leaf_size | param_model__n_neighbors | param_model__p | param_model__weights | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
33 | 0.464866 | 0.177969 | 0.321298 | 0.108862 | KNeighborsRegressor(algorithm='brute', leaf_si... | brute | 15 | 5 | 1 | distance | {'model': KNeighborsRegressor(algorithm='brute... | 0.698647 | 0.697457 | 0.664584 | 0.653614 | 0.661470 | 0.675154 | 0.019038 | 1 |
1 | 0.433704 | 0.030075 | 0.500921 | 0.148813 | KNeighborsRegressor(algorithm='brute', leaf_si... | ball_tree | 15 | 5 | 1 | distance | {'model': KNeighborsRegressor(algorithm='brute... | 0.690671 | 0.704571 | 0.668078 | 0.649002 | 0.661473 | 0.674759 | 0.020132 | 2 |
41 | 0.400773 | 0.014872 | 0.269459 | 0.013537 | KNeighborsRegressor(algorithm='brute', leaf_si... | brute | 30 | 5 | 1 | distance | {'model': KNeighborsRegressor(algorithm='brute... | 0.696442 | 0.704792 | 0.656956 | 0.652790 | 0.662252 | 0.674646 | 0.021578 | 3 |
9 | 0.359792 | 0.047904 | 0.415002 | 0.016745 | KNeighborsRegressor(algorithm='brute', leaf_si... | ball_tree | 30 | 5 | 1 | distance | {'model': KNeighborsRegressor(algorithm='brute... | 0.693818 | 0.699375 | 0.665714 | 0.646509 | 0.656008 | 0.672285 | 0.020833 | 4 |
17 | 0.384339 | 0.028234 | 0.732669 | 0.047760 | KNeighborsRegressor(algorithm='brute', leaf_si... | kd_tree | 15 | 5 | 1 | distance | {'model': KNeighborsRegressor(algorithm='brute... | 0.695070 | 0.700020 | 0.660117 | 0.643285 | 0.659642 | 0.671627 | 0.022069 | 5 |
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_model | param_model__C | param_model__degree | param_model__gamma | param_model__kernel | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 1.408159 | 0.030212 | 0.259344 | 0.007544 | SVR() | 1.0 | 3 | scale | rbf | {'model': SVR(), 'model__C': 1.0, 'model__degr... | 0.738775 | 0.726874 | 0.677521 | 0.704064 | 0.689964 | 0.707439 | 0.022678 | 1 |
10 | 1.968262 | 0.093380 | 0.250730 | 0.042791 | SVR() | 10.0 | 3 | scale | rbf | {'model': SVR(), 'model__C': 10.0, 'model__deg... | 0.700471 | 0.714393 | 0.671048 | 0.694977 | 0.678562 | 0.691890 | 0.015502 | 2 |
2 | 1.709086 | 0.836304 | 0.370426 | 0.166561 | SVR() | 0.1 | 3 | scale | rbf | {'model': SVR(), 'model__C': 0.1, 'model__degr... | 0.643136 | 0.621495 | 0.575418 | 0.609085 | 0.581199 | 0.606067 | 0.025215 | 3 |
5 | 2.218433 | 0.444332 | 0.145457 | 0.010865 | SVR() | 1.0 | 3 | scale | poly | {'model': SVR(), 'model__C': 1.0, 'model__degr... | 0.523625 | 0.424193 | 0.454011 | 0.477616 | 0.421954 | 0.460280 | 0.037745 | 4 |
8 | 1.463197 | 0.323881 | 0.164261 | 0.062616 | SVR() | 10.0 | 3 | scale | linear | {'model': SVR(), 'model__C': 10.0, 'model__deg... | 0.483942 | 0.447307 | 0.421951 | 0.456868 | 0.439885 | 0.449991 | 0.020472 | 5 |
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_model | param_model__criterion | param_model__max_features | param_model__min_samples_leaf | param_model__min_samples_split | param_model__n_estimators | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.985876 | 0.019441 | 0.013288 | 0.001239 | RandomForestRegressor(max_features='sqrt', min... | squared_error | sqrt | 1 | 5 | 20 | {'model': RandomForestRegressor(max_features='... | 0.536794 | 0.529643 | 0.496419 | 0.526319 | 0.483046 | 0.514444 | 0.020892 | 1 |
2 | 0.900409 | 0.034486 | 0.012528 | 0.001749 | RandomForestRegressor(max_features='sqrt', min... | squared_error | sqrt | 4 | 5 | 20 | {'model': RandomForestRegressor(max_features='... | 0.536388 | 0.510755 | 0.478170 | 0.525115 | 0.467117 | 0.503509 | 0.026709 | 2 |
4 | 0.834338 | 0.041532 | 0.014159 | 0.002465 | RandomForestRegressor(max_features='sqrt', min... | squared_error | log2 | 1 | 5 | 20 | {'model': RandomForestRegressor(max_features='... | 0.514058 | 0.516165 | 0.477799 | 0.523440 | 0.468673 | 0.500027 | 0.022283 | 3 |
1 | 0.965925 | 0.032833 | 0.013376 | 0.001395 | RandomForestRegressor(max_features='sqrt', min... | squared_error | sqrt | 1 | 10 | 20 | {'model': RandomForestRegressor(max_features='... | 0.521575 | 0.498732 | 0.462031 | 0.525669 | 0.490861 | 0.499774 | 0.023026 | 4 |
5 | 0.760986 | 0.091446 | 0.012131 | 0.001436 | RandomForestRegressor(max_features='sqrt', min... | squared_error | log2 | 1 | 10 | 20 | {'model': RandomForestRegressor(max_features='... | 0.527899 | 0.500122 | 0.461590 | 0.498032 | 0.477403 | 0.493009 | 0.022467 | 5 |
绘制学习曲线以了解模型的性能随着训练数据量的变化情况
对于三种方法,绘制不同模型的预测结果与真实结果的散点图,以便直观地比较它们的性能。并使用best_estimator_来评估验证集的预测突变效应与真实突变效应之间的相关性,使用Spearman相关系数。
KNeighborsRegressor SpearmanrResult(correlation=0.8000028700947366, pvalue=2.119700151311828e-241) -------------------------------------------------------------------------------- SVR SpearmanrResult(correlation=0.8134720990350951, pvalue=5.536753308261259e-256) -------------------------------------------------------------------------------- RandomForestRegressor SpearmanrResult(correlation=0.7206324349238818, pvalue=1.1276858229812068e-173) --------------------------------------------------------------------------------
根据结果,SVM在测试集上表现最好,spearman rho值为0.815,说明使用预训练ESM嵌入表征,能够在下游预测任务中取得不错的效果。
关于更加有效的zero-shot 突变预测方法,可以查看对应examples/variant-prediction文件夹。
对于RhlA酶突变及其相应的活性和选择性的例子,我们可以尝试使用ESM-1v模型进行数据泛化。然而,仅仅依赖序列特征是不够的,我们还需要考虑结构特征,以便深入挖掘突变对两种底物与酶之间催化几何构象和空间结构的影响。
Progen2的使用
Progen2是一种蛋白质序列生成语言模型,能够根据所需属性生成特定蛋白质序列,也可以用于对蛋白质序列的适应度进行评估。
Progen2的安装流程非常简单,我们已经集成到了这个环境中,可以下载模型参数后,直接使用。
/personal/soft/progen/progen2
loading parameters loading parameters took 58.58s loading tokenizer loading tokenizer took 0.04s log-likelihood (left-to-right, right-to-left) ll_sum=-595.1453857421875 ll_mean=-2.208328366279602 log-likelihood (left-to-right, right-to-left) took 0.90s done.
Progen2返回的打分ll_mean即为对序列在序列空间中可能性的评估,即概率的对数值,可以作为对序列适应度(fitness)的打分评估。sample.py可以用于对蛋白质进行序列生成式设计。
Relax
zhangjun19