Unsupervisedly Prompting AlphaFold2 for Few-Shot Learning of Accurate Folding Landscape and Protein Structure Prediction

Jun Zhang,Sirui Liu,Mengyun Chen,Haotian Chu,Min Wang,Zidong Wang,Jialiang Yu,Ningxi Ni,Fan Yu,Diqing Chen,Yi Isaac Yang,Boxin Xue,Lijiang Yang,Yuan Liu,Yi Qin Gao
DOI: https://doi.org/10.1021/acs.jctc.3c00528
2023-10-08
Abstract:Data-driven predictive methods which can efficiently and accurately transform protein sequences into biologically active structures are highly valuable for scientific research and medical development. Determining accurate folding landscape using co-evolutionary information is fundamental to the success of modern protein structure prediction methods. As the state of the art, AlphaFold2 has dramatically raised the accuracy without performing explicit co-evolutionary analysis. Nevertheless, its performance still shows strong dependence on available sequence homologs. Based on the interrogation on the cause of such dependence, we presented EvoGen, a meta generative model, to remedy the underperformance of AlphaFold2 for poor MSA targets. By prompting the model with calibrated or virtually generated homologue sequences, EvoGen helps AlphaFold2 fold accurately in low-data regime and even achieve encouraging performance with single-sequence predictions. Being able to make accurate predictions with few-shot MSA not only generalizes AlphaFold2 better for orphan sequences, but also democratizes its use for high-throughput applications. Besides, EvoGen combined with AlphaFold2 yields a probabilistic structure generation method which could explore alternative conformations of protein sequences, and the task-aware differentiable algorithm for sequence generation will benefit other related tasks including protein design.
Machine Learning,Artificial Intelligence,Biological Physics
What problem does this paper attempt to address?
The paper aims to address key issues in the field of protein structure prediction (PSP), especially how to improve prediction accuracy in situations with limited or lacking homologous sequences. Specifically, the paper focuses on the performance degradation of AlphaFold2 (AF2) when dealing with a small number of multiple sequence alignments (MSA), which contradicts Anfinsen's hypothesis that protein structures should be determined by their sequences. To tackle this issue, the research team developed a meta-generative model called EvoGen, which can enhance AF2's predictive capability in low-data environments by optimizing and generating calibrated or virtual homologous sequences. The core of EvoGen lies in deep probabilistic learning and the ability to generate MSA. It extracts folding-related patterns from MSA datasets through unsupervised learning and can calibrate or generate new MSA to optimize AF2's folding landscape. In this way, EvoGen not only helps AF2 predict protein structures in situations with limited MSA but also explores multiple conformations of protein sequences, which is significant for understanding the functional diversity of proteins and designing new proteins. Furthermore, the paper discusses how to use EvoGen-generated MSA to stabilize the folding process in few-sample scenarios, which involves directly "creating" virtual MSA patterns to form a smoother folding landscape. Through end-to-end differentiable training with AF2, EvoGen can generate MSA that favors the stability of the target structure, thereby significantly enhancing AF2's prediction accuracy in the absence of natural sequence information. In summary, the goal of this research is to overcome the limitations of existing protein structure prediction methods when dealing with orphan sequences or data-scarce scenarios, by introducing the EvoGen model to enhance AF2's generalization ability and exploration of diverse protein structures.