Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
Protein Folding with ESMFold and 🤗transformers
Transformers
Transformers
Bohrium
发布于 2023-06-13
推荐镜像 :Basic Image:bohrium-notebook:2023-04-07
推荐机型 :c2_m4_cpu
2
Protein Folding with ESMFold and 🤗transformers
Preparing your model and tokenizer
Performance optimizations
Folding a single chain
Folding multiple chains
Bulk predictions

Protein Folding with ESMFold and 🤗transformers

代码
文本

最后一次修改: dingzh@dp.tech

描述: 本教程主要参考 hugging face notebook,可在 Bohrium Notebook 上直接运行。你可以点击界面上方蓝色按钮 开始连接,选择 bohrium-notebook:2023-04-07 镜像及任意一款GPU节点配置,稍等片刻即可运行。 如您遇到任何问题,请联系 bohrium@dp.tech

共享协议: 本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。

代码
文本

ESMFold (paper link) is a recently released protein folding model from FAIR. Unlike other protein folding models, it does not require external databases or search tools to predict structures, and is up to 60X faster as a result.

The port to the HuggingFace transformers library is even easier to use, as we've removed the dependency on tools like openfold - once you pip install transformers, you're ready to use this model!

Note that all the code that follows will be running the model locally, rather than calling an external API. This means that no rate limiting applies here - you can predict as many structures as your computer can handle.

In testing, we found that ESMFold needs about 16-24GB of GPU memory to run well, depending on protein length. This may be too much for the smaller free GPUs on Colab.

代码
文本

First step, make sure you're up to date - you'll need the most recent release of transformers and accelerate! If you want to visualize your predicted protein structure in the notebook, you should also install py3Dmol.

代码
文本
[ ]
! pip install --upgrade transformers py3Dmol accelerate
代码
文本

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

代码
文本
[ ]
from transformers.utils import send_example_telemetry

send_example_telemetry("protein_folding_notebook", framework="pytorch")
代码
文本

Preparing your model and tokenizer

代码
文本

Now we load our model and tokenizer. If using GPU, use model.cuda() to transfer the model to GPU.

代码
文本
[ ]
from transformers import AutoTokenizer, EsmForProteinFolding

tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1")
model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1", low_cpu_mem_usage=True)

model = model.cuda()
代码
文本

Performance optimizations

代码
文本

Since ESMFold is quite a large model, there are some considerations regarding memory usage and performance.

Firstly, we can optionally convert the language model stem to float16 to improve performance and memory usage when running on a modern GPU. This was used during model training, and so should not make the outputs from the rest of the model invalid.

代码
文本
[ ]
# Uncomment to switch the stem to float16
model.esm = model.esm.half()
代码
文本

Secondly, you can enable TensorFloat32 computation for a general speedup if your hardware supports it. This line has no effect if your hardware doesn't support it.

代码
文本
[ ]
import torch

torch.backends.cuda.matmul.allow_tf32 = True
代码
文本

Finally, we can reduce the 'chunk_size' used in the folding trunk. Smaller chunk sizes use less memory, but have slightly worse performance.

代码
文本
[ ]
# Uncomment this line if your GPU memory is 16GB or less, or if you're folding longer (over 600 or so) sequences
model.trunk.set_chunk_size(64)
代码
文本

Folding a single chain

代码
文本

First, we tokenize our input. If you've used transformers before, proteins are processed like any other input string. Make sure not to add special tokens - ESM was trained with them, but ESMFold was trained without them.

代码
文本
[ ]
# This is the sequence for human GNAT1, because I worked on it when
# I was a postdoc and so everyone else has to learn to appreciate it too.
# Feel free to substitute your own peptides of interest
# Depending on memory constraints you may wish to use shorter sequences.
test_protein = "MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF"

tokenized_input = tokenizer([test_protein], return_tensors="pt", add_special_tokens=False)['input_ids']

代码
文本

If you're using a GPU, you'll need to move the tokenized data to the GPU now.

代码
文本
[ ]
tokenized_input = tokenized_input.cuda()
代码
文本

With our preparations out of the way, getting your model outputs is as simple as...

代码
文本
[ ]
import torch

with torch.no_grad():
output = model(tokenized_input)
代码
文本

Now here's the tricky bit - we convert the model outputs to a PDB file. This will likely be moved to a function in transformers in the future, but everything's still quite new, so it lives here for now! This code comes from the original ESMFold repo, and uses some functions from openfold that have been ported to transformers.

代码
文本
[ ]
from transformers.models.esm.openfold_utils.protein import to_pdb, Protein as OFProtein
from transformers.models.esm.openfold_utils.feats import atom14_to_atom37

def convert_outputs_to_pdb(outputs):
final_atom_positions = atom14_to_atom37(outputs["positions"][-1], outputs)
outputs = {k: v.to("cpu").numpy() for k, v in outputs.items()}
final_atom_positions = final_atom_positions.cpu().numpy()
final_atom_mask = outputs["atom37_atom_exists"]
pdbs = []
for i in range(outputs["aatype"].shape[0]):
aa = outputs["aatype"][i]
pred_pos = final_atom_positions[i]
mask = final_atom_mask[i]
resid = outputs["residue_index"][i] + 1
pred = OFProtein(
aatype=aa,
atom_positions=pred_pos,
atom_mask=mask,
residue_index=resid,
b_factors=outputs["plddt"][i],
chain_index=outputs["chain_index"][i] if "chain_index" in outputs else None,
)
pdbs.append(to_pdb(pred))
return pdbs
代码
文本
[ ]
pdb = convert_outputs_to_pdb(output)
代码
文本

Now we have our pdb - can we visualize it?

代码
文本
[ ]
import py3Dmol

view = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js', width=800, height=400)
view.addModel("".join(pdb), 'pdb')
view.setStyle({'model': -1}, {"cartoon": {'color': 'spectrum'}})
代码
文本

Looks good! We can colour it differently, though - our model outputs a plddt field containing probabilities for each atom, indicating how confident it is in that part of the structure. In the conversion function above we added the plddt field in the b_factors argument, so it was included in our pdb string. Let's use it so that we can see high- and low-confidence areas of the structure visually!

代码
文本
[ ]
# The plddt field is scaled from 0-1 on earlier versions of ESMFold but will be updated
# to match AlphaFold's scale of 0-100 in future versions.
# We check here so that this code will work on either:

if torch.max(output['plddt']) <= 1.0:
vmin = 0.5
vmax = 0.95
else:
vmin = 50
vmax = 95

view.setStyle({'cartoon': {'colorscheme': {'prop':'b','gradient': 'roygb','min': vmin,'max': vmax}}})
代码
文本

Blue indicates high confidence, so that's a pretty high-quality prediction! Not too surprising considering GNAT1 was almost certainly in the training data, but nevertheless good to see. Finally, we can write our PDB string out to a file, which you can download and use in other tools.

代码
文本
[ ]
with open("output_structure.pdb", "w") as f:
f.write("".join(pdb))
代码
文本

If you're running this in Colab (and haven't run out of memory by now!) then you can download the file we just created using the file browser interface at the left - the button looks like a little folder icon.

代码
文本

Folding multiple chains

代码
文本

Many proteins exist as complexes, either as multiple copies of the same peptide (a homopolymer), or a complex of different ones (a heteropolymer). To generate folds for such structures in ESMFold, we use a trick from the paper - we insert a "linker" of flexible glycine residues between each chain we want to fold simultaneously, and then we offset the position IDs for each chain from each other, so that the model treats them as being very distant portions of the same long chain. This works quite well, so let's see it in action! We'll use Glucosamine-6-phosphate deaminase (Uniprot: Q9CMF4) from the paper as an example.

代码
文本

First, we define the sequence of the monomer, and the poly-G linker we want to use. Then we stick two copies of the monomer together with the linker in between.

代码
文本
[ ]
sequence = "MRLIPLHNVDQVAKWSARYIVDRINQFQPTEARPFVLGLPTGGTPLKTYEALIELYKAGEVSFKHVVTFNMDEYVGLPKEHPESYHSFMYKNFFDHVDIQEKNINILNGNTEDHDAECQRYEEKIKSYGKIHLFMGGVGVDGHIAFNEPASSLSSRTRIKTLTEDTLIANSRFFDNDVNKVPKYALTIGVGTLLDAEEVMILVTGYNKAQALQAAVEGSINHLWTVTALQMHRRAIIVCDEPATQELKVKTVKYFTELEASAIRSVK"

linker = 'G' * 25

homodimer_sequence = sequence + linker + sequence
代码
文本

Now we tokenize the full homodimer sequence just like we did with the monomer sequence above.

代码
文本
[ ]
tokenized_homodimer = tokenizer([homodimer_sequence], return_tensors="pt", add_special_tokens=False)
代码
文本

Now here's the tricky bit - we need to tweak the inputs a bit so the model doesn't think this is just a single peptide. The way we do that is by using the position_ids input to the model. The position_ids input tells the model the position of each amino acid in the input chain. By default, the model assumes that you've passed it one linear, contiguous chain - in other words, if you give it a peptide with 100 amino acids, it will assume the position_ids are just [0, 1, ..., 98, 99] unless you tell it otherwise.

We want to make very clear that the two subunits aren't connected, though, so let's add a large offset to the position IDs of the second chain. The original repo uses 512, so let's stick with that.

代码
文本
[ ]
with torch.no_grad():
position_ids = torch.arange(len(homodimer_sequence), dtype=torch.long)
position_ids[len(sequence) + len(linker):] += 512
print(position_ids)
代码
文本

Now we're ready to predict! Let's add our position_ids to the tokenized inputs, but make sure to add a singleton batch dimension first to match the other arrays in there! Once that's done we can transfer that dict to the GPU and we're ready to get our folded structure.

代码
文本
[ ]
tokenized_homodimer['position_ids'] = position_ids.unsqueeze(0)

tokenized_homodimer = {key: tensor.cuda() for key, tensor in tokenized_homodimer.items()}
代码
文本

Now we compute predictions just like before.

代码
文本
[ ]
with torch.no_grad():
output = model(**tokenized_homodimer)
代码
文本

Next, we need to remove the poly-G linker from the output, so we can display the structure as fully independent chains. To do that, we'll alter the atom37_atom_exists field in the output. This field indicates, for display purposes, which atoms are present at each residue position. We will simply set all of the atoms for each of the linker residues to 0.

代码
文本
[ ]
linker_mask = torch.tensor([1] * len(sequence) + [0] * len(linker) + [1] * len(sequence))[None, :, None]

output['atom37_atom_exists'] = output['atom37_atom_exists'] * linker_mask.to(output['atom37_atom_exists'].device)
代码
文本

With those output tweaks done, now we can convert the output to PDB and view it as before.

代码
文本
[ ]
pdb = convert_outputs_to_pdb(output)
代码
文本
[ ]
view = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js', width=800, height=400)
view.addModel("".join(pdb), 'pdb')

# The plddt field is scaled from 0-1 on earlier versions of ESMFold but will be updated
# to match AlphaFold's scale of 0-100 in future versions.
# We check here so that this code will work on either:

if torch.max(output['plddt']) <= 1.0:
vmin = 0.5
vmax = 0.95
else:
vmin = 50
vmax = 95

view.setStyle({'cartoon': {'colorscheme': {'prop':'b','gradient': 'roygb','min': vmin,'max': vmax}}})
代码
文本

And there's our dimer structure! As in the first example, we can now write this structure out to a PDB file and use it in downstream tools:

代码
文本
[ ]
with open("output_structure.pdb", "w") as f:
f.write("".join(pdb))
代码
文本

Tip: If you're trying to predict a multimeric structure and you're getting low-quality outputs, try varying the order of the chains (if it's a heteropolymer) or the length of the linker.

代码
文本

Bulk predictions

代码
文本

Predicting single structures is nice, but the great advantage of running ESMFold locally is the fact that it's extremely fast while still producing highly accurate predictions. This makes it very suitable for proteomics work. Let's see that in action here - we're going to grab a set of monomeric proteins in E. Coli from Uniprot and fold all of them with high accuracy on a single GPU in a couple of minutes (depending on your GPU!)

We do this because we can, and to upset any crystallographer friends we may have. First, you may need to install requests, tqdm and pandas if you don't have them already, to handle the data we grab from Uniprot.

代码
文本
[ ]
# Uncomment and run this cell to install
#! pip install requests pandas tqdm
代码
文本

Next, let's prepare the URL for our Uniprot query.

代码
文本
[ ]
import requests

uniprot_url = "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Csequence&format=tsv&query=%28%28taxonomy_id%3A83333%29%20AND%20%28reviewed%3Atrue%29%20AND%20%28length%3A%5B128%20TO%20512%5D%29%20AND%20%28cc_subunit%3Amonomer%29%29"
代码
文本

This uniprot URL might seem mysterious, but it isn't! To get it, we searched for (taxonomy_id:83333) AND (reviewed:true) AND (length:[128 TO 512]) AND (cc_subunit:monomer) on UniProt to get all monomeric E.coli proteins of reasonable length, then selected 'Download', and set the format to TSV and the columns to Sequence.

Once that's done, selecting Generate URL for API gives you a URL you can pass to Requests. Alternatively, if you're not on Colab you can just download the data through the web interface and open the file locally.

代码
文本
[ ]
uniprot_request = requests.get(uniprot_url)
代码
文本

To get this data into Pandas, we use a BytesIO object, which Pandas will treat like a file. If you downloaded the data as a file you can skip this bit and just pass the filepath directly to read_csv.

代码
文本
[ ]
from io import BytesIO
import pandas

bio = BytesIO(uniprot_request.content)

df = pandas.read_csv(bio, compression='gzip', sep='\t')
df = df.dropna() # Remove empty columns, just in case
df
代码
文本

If you have time, you could process this entire list, giving you folded structures for the entire monomeric proteome of E. Coli. For the sake of this demo, though, let's limit ourselves to 10:

代码
文本
[ ]
df = df.iloc[:10]
代码
文本

Now let's pull out the sequences and batch-tokenize all of them.

代码
文本
[ ]
ecoli_tokenized = tokenizer(df.Sequence.tolist(), padding=False, add_special_tokens=False)['input_ids']
代码
文本

Now we loop over our tokenized data, passing each sequence into our model:

代码
文本
[ ]
from tqdm import tqdm

outputs = []

with torch.no_grad():
for input_ids in tqdm(ecoli_tokenized):
input_ids = torch.tensor(input_ids, device='cuda').unsqueeze(0)
output = model(input_ids)
outputs.append({key: val.cpu() for key, val in output.items()})
代码
文本

Now we have 10 model outputs, which we can convert to PDB in bulk. If you get an error here, make sure you've run the cell above that defines the convert_outputs_to_pdb function!

代码
文本
[ ]
pdb_list = [convert_outputs_to_pdb(output) for output in outputs]
代码
文本

Let's inspect one of them to see what we got.

代码
文本
[ ]
import py3Dmol

view = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js', width=800, height=400)
view.addModel("".join(pdb_list[0]), 'pdb')

# The plddt field is scaled from 0-1 on earlier versions of ESMFold but will be updated
# to match AlphaFold's scale of 0-100 in future versions.
# We check here so that this code will work on either:

if torch.max(output['plddt']) <= 1.0:
vmin = 0.5
vmax = 0.95
else:
vmin = 50
vmax = 95

view.setStyle({'cartoon': {'colorscheme': {'prop':'b','gradient': 'roygb','min': vmin,'max': vmax}}})
代码
文本

Looks good to me! Now we can save all of these to disc together.

代码
文本
[ ]
protein_identifiers = df.Entry.tolist()
for identifier, pdb in zip(protein_identifiers, pdb_list):
with open(f"{identifier}.pdb", "w") as f:
f.write("".join(pdb))
代码
文本

If you're on Colab, you can download all of these to your local machine using the file interface on the left.

代码
文本
Transformers
Transformers
点个赞吧
推荐阅读
公开
Quick Tour for Hugging Face Transformers
TransformersEnglish Hugging Face
TransformersEnglish Hugging Face
Bohrium
发布于 2023-06-13
2 赞1 转存文件
公开
Fine-Tuning Protein Language Models
Protein Language Model Hugging FaceEnglishTransformers
Protein Language Model Hugging FaceEnglishTransformers
Bohrium
发布于 2023-06-13
2 转存文件