OpenProteinSet: Training data for structural biology at scale

Gustaf Ahdritz,Nazim Bouatta,Sachin Kadyan,Lukas Jarosch,Daniel Berenberg,Ian Fisk,Andrew M. Watkins,Stephen Ra,Richard Bonneau,Mohammed AlQuraishi
2023-08-10
Abstract:Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
Biomolecules,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is that the quantity and quality of current protein multiple sequence alignments (MSAs) datasets cannot meet the needs of modern machine learning methods. Specifically: 1. **Importance of MSAs**: MSAs are very important in bioinformatics as they encode rich functional and structural information, and are widely used in tasks such as protein design and protein structure prediction. Recent breakthroughs like AlphaFold2 have made significant progress by directly processing large-scale raw MSAs. 2. **Limitations of existing datasets**: Despite the importance of MSAs, the currently available datasets are limited in number and not updated in a timely manner. For example, the internal dataset used by AlphaFold2 contains millions of MSAs, but this data has not been made public. Existing public MSA databases are smaller in scale and outdated, unable to meet the needs of large-scale machine learning research. 3. **High computational cost**: Generating high-quality MSAs requires a large amount of computational resources, which means that only a few large research teams can conduct related research, limiting further development in this field. To address these issues, the authors introduce OpenProteinSet, an open-source dataset containing over 16 million precomputed MSAs. This dataset includes MSAs from the Protein Data Bank as well as MSAs computed from the Uniclust30 database. The authors hope that OpenProteinSet will promote research in tasks such as protein structure, function, and design, and drive the development of large-scale multimodal machine learning.