Unsupervised domain classification of AlphaFold2-predicted protein structures

Federico Barone,Alessandro Laio,Marco Punta,Stefano Cozzini,Alessio Ansuini,Alberto Cazzaniga

DOI: https://doi.org/10.1101/2024.08.21.608992

2024-08-21

Abstract:The release of the AlphaFold database, which contains 214 million predicted protein structures, represents a major leap forward for proteomics and its applications. However, lack of comprehensive protein annotation limits its accessibility and usability. Here, we present DPCstruct, an unsupervised clustering algorithm designed to provide domain-level classification of protein structures. Using structural predictions from AlphaFold2 and comprehensive all-against-all local alignments from Foldseek, DPCstruct identifies and groups recurrent structural motifs into domain clusters. When applied to the Foldseek Cluster database, a representative set of proteins from AlphaFoldDB, DPCstruct successfully recovers the majority of protein folds catalogued in established databases such as SCOP and CATH. Out of the 28,246 clusters identified by DPCstruct, 24% have no structural or sequence similarity to known protein families. Supported by a modular and efficient implementation, classifying 15 million entries in less than 48 hours, DPCstruct is well suited for large-scale proteomics and metagenomics applications. It also facilitates the rapid incorporation of updates from the latest structural prediction tools, thus making sure that the classification is up-to-date. The DPCstruct pipeline and associated database are freely available in a dedicated repository, enhancing the navigation of the AlphaFoldDB through domain annotations and enabling rapid classification of other protein datasets.

Bioinformatics

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Large-scale classification of protein structures**: The AlphaFold database contains 214 million predicted protein structures, which is a significant advancement for proteomics and its applications. However, the lack of comprehensive protein annotations limits its accessibility and usability. 2. **Domain-level classification**: Most current methods focus on clustering entire proteins, whereas this paper proposes a new method—DPCstruct—for classifying proteins at the domain level. This approach helps identify functional modules of proteins, which are often more conserved during evolution. 3. **Automated and efficient processing of large-scale data**: DPCstruct can efficiently handle large-scale datasets (e.g., 15 million entries) and complete the classification within 48 hours. This makes the method highly suitable for large-scale proteomics and metagenomics applications. 4. **Discovery of new protein families or fold types**: By identifying unannotated domains, DPCstruct has the potential to discover new protein families or fold types, thereby further enriching the annotation and functional prediction of the protein space. In summary, the goal of this paper is to develop an efficient and automated tool to address the annotation and classification of large-scale protein structure data, particularly at the domain level, to promote the research and development of protein functions.

Unsupervised domain classification of AlphaFold2-predicted protein structures

DomBpred: Protein Domain Boundary Prediction Based on Domain-Residue Clustering Using Inter-Residue Distance.

ECOD domain classification of 48 whole proteomes from AlphaFold Structure Database using DPAM2

Bridging the Gap between Sequence and Structure Classifications of Proteins with AlphaFold Models

Exploring structural diversity across the protein universe with The Encyclopedia of Domains

Large protein databases reveal structural complementarity and functional locality

AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences

Prediction of Protein Domain Folding Classes

Highly accurate protein structure prediction for the human proteome

AlphaFold predictions on whole genomes at a glance

Multi-domain and complex protein structure prediction using inter-domain interactions from deep learning

AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models

CombFold: predicting structures of large protein assemblies using a combinatorial assembly algorithm and AlphaFold2

Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier.

Generation of a high confidence set of domain-domain interface types to guide protein complex structure predictions by AlphaFold

DEMO2: Assemble Multi-Domain Protein Structures by Coupling Analogous Template Alignments with Deep-Learning Inter-Domain Restraint Prediction

Predictomes: A classifier-curated database of AlphaFold-modeled protein-protein interactions

Unmasking AlphaFold to integrate experiments and predictions in multimeric complexes