Clustering protein functional families at large scale with hierarchical approaches

Nicola Bordin,Harry Scholes,Clemens Rauer,Joel Roca‐Martínez,Ian Sillitoe,Christine Orengo
DOI: https://doi.org/10.1002/pro.5140
IF: 8
2024-08-18
Protein Science
Abstract:Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large‐scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH‐eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH‐eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.
biochemistry & molecular biology
What problem does this paper attempt to address?