Unsupervised domain classification of AlphaFold2-predicted protein structures

Federico Barone,Alessandro Laio,Marco Punta,Stefano Cozzini,Alessio Ansuini,Alberto Cazzaniga
DOI: https://doi.org/10.1101/2024.08.21.608992
2024-08-21
Abstract:The release of the AlphaFold database, which contains 214 million predicted protein structures, represents a major leap forward for proteomics and its applications. However, lack of comprehensive protein annotation limits its accessibility and usability. Here, we present DPCstruct, an unsupervised clustering algorithm designed to provide domain-level classification of protein structures. Using structural predictions from AlphaFold2 and comprehensive all-against-all local alignments from Foldseek, DPCstruct identifies and groups recurrent structural motifs into domain clusters. When applied to the Foldseek Cluster database, a representative set of proteins from AlphaFoldDB, DPCstruct successfully recovers the majority of protein folds catalogued in established databases such as SCOP and CATH. Out of the 28,246 clusters identified by DPCstruct, 24% have no structural or sequence similarity to known protein families. Supported by a modular and efficient implementation, classifying 15 million entries in less than 48 hours, DPCstruct is well suited for large-scale proteomics and metagenomics applications. It also facilitates the rapid incorporation of updates from the latest structural prediction tools, thus making sure that the classification is up-to-date. The DPCstruct pipeline and associated database are freely available in a dedicated repository, enhancing the navigation of the AlphaFoldDB through domain annotations and enabling rapid classification of other protein datasets.
Bioinformatics
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Large-scale classification of protein structures**: The AlphaFold database contains 214 million predicted protein structures, which is a significant advancement for proteomics and its applications. However, the lack of comprehensive protein annotations limits its accessibility and usability. 2. **Domain-level classification**: Most current methods focus on clustering entire proteins, whereas this paper proposes a new method—DPCstruct—for classifying proteins at the domain level. This approach helps identify functional modules of proteins, which are often more conserved during evolution. 3. **Automated and efficient processing of large-scale data**: DPCstruct can efficiently handle large-scale datasets (e.g., 15 million entries) and complete the classification within 48 hours. This makes the method highly suitable for large-scale proteomics and metagenomics applications. 4. **Discovery of new protein families or fold types**: By identifying unannotated domains, DPCstruct has the potential to discover new protein families or fold types, thereby further enriching the annotation and functional prediction of the protein space. In summary, the goal of this paper is to develop an efficient and automated tool to address the annotation and classification of large-scale protein structure data, particularly at the domain level, to promote the research and development of protein functions.