Protein Folding with ESMFold and 🤗transformers
最后一次修改: dingzh@dp.tech
描述: 本教程主要参考 hugging face notebook,可在 Bohrium Notebook 上直接运行。你可以点击界面上方蓝色按钮
开始连接
,选择bohrium-notebook:2023-04-07
镜像及任意一款GPU
节点配置,稍等片刻即可运行。 如您遇到任何问题,请联系 bohrium@dp.tech 。共享协议: 本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
ESMFold (paper link) is a recently released protein folding model from FAIR. Unlike other protein folding models, it does not require external databases or search tools to predict structures, and is up to 60X faster as a result.
The port to the HuggingFace transformers
library is even easier to use, as we've removed the dependency on tools like openfold
- once you pip install transformers
, you're ready to use this model!
Note that all the code that follows will be running the model locally, rather than calling an external API. This means that no rate limiting applies here - you can predict as many structures as your computer can handle.
In testing, we found that ESMFold needs about 16-24GB of GPU memory to run well, depending on protein length. This may be too much for the smaller free GPUs on Colab.
First step, make sure you're up to date - you'll need the most recent release of transformers
and accelerate
! If you want to visualize your predicted protein structure in the notebook, you should also install py3Dmol.
We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.
Preparing your model and tokenizer
Now we load our model and tokenizer. If using GPU, use model.cuda()
to transfer the model to GPU.
Performance optimizations
Since ESMFold is quite a large model, there are some considerations regarding memory usage and performance.
Firstly, we can optionally convert the language model stem to float16 to improve performance and memory usage when running on a modern GPU. This was used during model training, and so should not make the outputs from the rest of the model invalid.
Secondly, you can enable TensorFloat32 computation for a general speedup if your hardware supports it. This line has no effect if your hardware doesn't support it.
Finally, we can reduce the 'chunk_size' used in the folding trunk. Smaller chunk sizes use less memory, but have slightly worse performance.
Folding a single chain
First, we tokenize our input. If you've used transformers
before, proteins are processed like any other input string. Make sure not to add special tokens - ESM was trained with them, but ESMFold was trained without them.
If you're using a GPU, you'll need to move the tokenized data to the GPU now.
With our preparations out of the way, getting your model outputs is as simple as...
Now here's the tricky bit - we convert the model outputs to a PDB file. This will likely be moved to a function in transformers
in the future, but everything's still quite new, so it lives here for now! This code comes from the original ESMFold repo, and uses some functions from openfold
that have been ported to transformers
.
Now we have our pdb - can we visualize it?
Looks good! We can colour it differently, though - our model outputs a plddt
field containing probabilities for each atom, indicating how confident it is in that part of the structure. In the conversion function above we added the plddt
field in the b_factors
argument, so it was included in our pdb
string. Let's use it so that we can see high- and low-confidence areas of the structure visually!
Blue indicates high confidence, so that's a pretty high-quality prediction! Not too surprising considering GNAT1 was almost certainly in the training data, but nevertheless good to see. Finally, we can write our PDB string out to a file, which you can download and use in other tools.
If you're running this in Colab (and haven't run out of memory by now!) then you can download the file we just created using the file browser interface at the left - the button looks like a little folder icon.
Folding multiple chains
Many proteins exist as complexes, either as multiple copies of the same peptide (a homopolymer), or a complex of different ones (a heteropolymer). To generate folds for such structures in ESMFold, we use a trick from the paper - we insert a "linker" of flexible glycine residues between each chain we want to fold simultaneously, and then we offset the position IDs for each chain from each other, so that the model treats them as being very distant portions of the same long chain. This works quite well, so let's see it in action! We'll use Glucosamine-6-phosphate deaminase (Uniprot: Q9CMF4) from the paper as an example.
First, we define the sequence of the monomer, and the poly-G linker we want to use. Then we stick two copies of the monomer together with the linker in between.
Now we tokenize the full homodimer sequence just like we did with the monomer sequence above.
Now here's the tricky bit - we need to tweak the inputs a bit so the model doesn't think this is just a single peptide. The way we do that is by using the position_ids
input to the model. The position_ids
input tells the model the position of each amino acid in the input chain. By default, the model assumes that you've passed it one linear, contiguous chain - in other words, if you give it a peptide with 100 amino acids, it will assume the position_ids
are just [0, 1, ..., 98, 99]
unless you tell it otherwise.
We want to make very clear that the two subunits aren't connected, though, so let's add a large offset to the position IDs of the second chain. The original repo uses 512, so let's stick with that.
Now we're ready to predict! Let's add our position_ids
to the tokenized inputs, but make sure to add a singleton batch dimension first to match the other arrays in there! Once that's done we can transfer that dict to the GPU and we're ready to get our folded structure.
Now we compute predictions just like before.
Next, we need to remove the poly-G linker from the output, so we can display the structure as fully independent chains. To do that, we'll alter the atom37_atom_exists
field in the output. This field indicates, for display purposes, which atoms are present at each residue position. We will simply set all of the atoms for each of the linker residues to 0.
With those output tweaks done, now we can convert the output to PDB and view it as before.
And there's our dimer structure! As in the first example, we can now write this structure out to a PDB file and use it in downstream tools:
Tip: If you're trying to predict a multimeric structure and you're getting low-quality outputs, try varying the order of the chains (if it's a heteropolymer) or the length of the linker.
Bulk predictions
Predicting single structures is nice, but the great advantage of running ESMFold locally is the fact that it's extremely fast while still producing highly accurate predictions. This makes it very suitable for proteomics work. Let's see that in action here - we're going to grab a set of monomeric proteins in E. Coli from Uniprot and fold all of them with high accuracy on a single GPU in a couple of minutes (depending on your GPU!)
We do this because we can, and to upset any crystallographer friends we may have. First, you may need to install requests
, tqdm
and pandas
if you don't have them already, to handle the data we grab from Uniprot.
Next, let's prepare the URL for our Uniprot query.
This uniprot URL might seem mysterious, but it isn't! To get it, we searched for (taxonomy_id:83333) AND (reviewed:true) AND (length:[128 TO 512]) AND (cc_subunit:monomer)
on UniProt to get all monomeric E.coli proteins of reasonable length, then selected 'Download', and set the format to TSV and the columns to Sequence
.
Once that's done, selecting Generate URL for API
gives you a URL you can pass to Requests. Alternatively, if you're not on Colab you can just download the data through the web interface and open the file locally.
To get this data into Pandas, we use a BytesIO
object, which Pandas will treat like a file. If you downloaded the data as a file you can skip this bit and just pass the filepath directly to read_csv
.
If you have time, you could process this entire list, giving you folded structures for the entire monomeric proteome of E. Coli. For the sake of this demo, though, let's limit ourselves to 10:
Now let's pull out the sequences and batch-tokenize all of them.
Now we loop over our tokenized data, passing each sequence into our model:
Now we have 10 model outputs, which we can convert to PDB in bulk. If you get an error here, make sure you've run the cell above that defines the convert_outputs_to_pdb function!
Let's inspect one of them to see what we got.
Looks good to me! Now we can save all of these to disc together.
If you're on Colab, you can download all of these to your local machine using the file interface on the left.