Retrieval Augmented Protein Language Models for Protein Structure Prediction

Pan Li,Xingyi Cheng,Le Song,Eric Xing
DOI: https://doi.org/10.1101/2024.12.02.626519
2024-12-05
Abstract:The advent of advanced artificial intelligence technology has significantly accelerated progress in protein structure prediction. AlphaFold2, a pioneering method in this field, has set a new benchmark for prediction accuracy by leveraging the Evoformer module to automatically extract co-evolutionary information from multiple sequence alignments (MSA). However, the efficacy of structure prediction methods like AlphaFold2 is heavily dependent on the depth and quality of the MSA. To address this limitation, we propose two novel models, AIDO.RAGPLM and AIDO.RAGFold, which are pretrained modules for etrieval- u mented protein language model and structure prediction in an AI-driven Digital Organism [ ]. AIDO.RAGPLM integrates pre-trained protein language models with retrieved MSA, allowing for the incorporation of co-evolutionary information in structure prediction while compensating for insufficient MSA information through large-scale pretraining. Our method surpasses single-sequence protein language models in perplexity, contact prediction, and fitness prediction. We utilized AIDO.RAGPLM as the feature extractor for protein structure prediction, resulting in the development of AIDO.RAGFold. When sufficient MSA is available, AIDO.RAGFold achieves TM-scores comparable to AlphaFold2 and operates up to eight times faster. In scenarios where MSA is insufficient, our method significantly outperforms AlphaFold2 (ΔTM-score=0.379, 0.116 and 0.059 for 0, 5 and 10 MSA sequences as input). Additionally, we developed an MSA retriever for MSA searching from the UniClust30 database using hierarchical ID generation, which is 45 to 90 times faster than traditional methods, and is used to expand the MSA training set for AIDO.RAGPLM by 32%. Our findings suggest that AIDO.RAGPLM provides an efficient and accurate solution for protein structure prediction.
Bioinformatics
What problem does this paper attempt to address?