HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative

Xiaomin Fang,Fan Wang,Lihang Liu,Jingzhou He,Dayong Lin,Yingfei Xiang,Xiaonan Zhang,Hua Wu,Hui Li,Le Song
DOI: https://doi.org/10.1038/s42256-023-00721-6
2023-02-22
Abstract:AI-based protein structure prediction pipelines, such as AlphaFold2, have achieved near-experimental accuracy. These advanced pipelines mainly rely on Multiple Sequence Alignments (MSAs) as inputs to learn the co-evolution information from the homologous sequences. Nonetheless, searching MSAs from protein databases is time-consuming, usually taking dozens of minutes. Consequently, we attempt to explore the limits of fast protein structure prediction by using only primary sequences of proteins. HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2. Our proposed method, HelixFold-Single, first pre-trains a large-scale protein language model (PLM) with thousands of millions of primary sequences utilizing the self-supervised learning paradigm, which will be used as an alternative to MSAs for learning the co-evolution information. Then, by combining the pre-trained PLM and the essential components of AlphaFold2, we obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence. HelixFold-Single is validated in datasets CASP14 and CAMEO, achieving competitive accuracy with the MSA-based methods on the targets with large homologous families. Furthermore, HelixFold-Single consumes much less time than the mainstream pipelines for protein structure prediction, demonstrating its potential in tasks requiring many predictions. The code of HelixFold-Single is available at <a class="link-external link-https" href="https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold-single" rel="external noopener nofollow">this https URL</a>, and we also provide stable web services on <a class="link-external link-https" href="https://paddlehelix.baidu.com/app/drug/protein-single/forecast" rel="external noopener nofollow">this https URL</a>.
Biomolecules,Artificial Intelligence,Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the speed and accuracy of protein structure prediction while reducing the dependence on Multiple Sequence Alignments (MSAs). Specifically, existing artificial - intelligence - based protein structure prediction methods, such as AlphaFold2, although their prediction accuracy is close to that of experimental methods, they mainly rely on MSAs to learn the co - evolutionary information of homologous sequences. However, searching for MSAs from the protein database is a time - consuming process, usually taking dozens of minutes, which is unacceptable in tasks requiring high - throughput requests (such as protein design). Therefore, the paper proposes a new method - HelixFold - Single, aiming to achieve fast and accurate protein structure prediction by using only the primary sequence of the protein (i.e., the amino acid sequence), combining the large - scale Protein Language Model (PLM) with the powerful geometric learning ability of AlphaFold2. This method not only reduces the time consumption but also shows competitive accuracy when dealing with targets with a large number of homologous families, especially showing a significant advantage in prediction speed. The main contributions of the paper are: 1. **Reducing the dependence on MSAs**: By using a large - scale PLM instead of MSAs to learn co - evolutionary information, the dependence on MSAs in the prediction process is reduced. 2. **Increasing the prediction speed**: Compared with traditional MSA - dependent methods, HelixFold - Single has a significant improvement in prediction speed, especially when dealing with short protein chains. 3. **Maintaining the prediction accuracy**: Even without MSAs, HelixFold - Single can still maintain a relatively high prediction accuracy, especially for targets with rich homologous sequences. These improvements make HelixFold - Single have broad application potential in tasks requiring a large number of structure predictions, such as drug design and vaccine development.