Large protein databases reveal structural complementarity and functional locality

Pawel Szczerbiak,Lukasz M. Szydlowski,Witold Wydmanski,P. Douglas Renfrew,Julia Koehler Leman,Tomasz Kosciolek
DOI: https://doi.org/10.1101/2024.08.14.607935
2024-10-16
Abstract:Recent breakthroughs in protein structure prediction have led to an unprecedented surge in high-quality 3D models, highlighting the need for efficient computational solutions to manage and analyze this wealth of structural data. In our work, we comprehensively examine the structural clusters obtained from the AlphaFold Protein Structure Database (AFDB), a highquality subset of ESMAtlas, and the Microbiome Immunity Project (MIP). We create a single cohesive low-dimensional representation of the resulting protein space. Our results show that, while each database occupies distinct regions within the protein structure space, they collectively exhibit significant overlap in their functional profiles. High-level biological functions tend to cluster in particular regions, revealing a shared functional landscape despite the diverse sources of data. To facilitate exploration and improve access to our data, we developed an open-access web server. Our findings lay the groundwork for more in-depth studies concerning protein sequence-structure-function relationships, where various biological questions can be asked about taxonomic assignments, environmental factors, or functional specificity.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to comprehensively understand and analyze the relationship between structure and function in the current large protein structure databases. Specifically, the author creates an integrated low - dimensional protein space representation by integrating three major protein structure databases - AlphaFold Protein Structure Database (AFDB), ESMAtlas and Microbiome Immunity Project (MIP). This not only helps to reveal the complementarity and overlap of different data sources in the structural space, but also enables the exploration of the aggregation phenomenon of high - order biological functions in specific regions, namely "functional locality". In addition, the paper also develops an open - access web server to promote the customized exploration of these structural data sets. The key points of the paper include: 1. **Exploration of structural space**: By analyzing protein structures from different databases, study their distribution in the structural space and how these distributions reflect the functional characteristics of proteins. 2. **Functional locality**: The study found that high - order biological functions tend to aggregate in specific regions, indicating that even with diverse data sources, the functions of proteins also show a certain pattern in the structural space. 3. **Complementarity of databases**: Although each database occupies different regions in the structural space, they show significant overlap in the functional profile, showing the complementarity between databases. 4. **Tools and methods**: Developed tools and methods for exploring and analyzing protein structure space, including using deepFRI for functional annotation and using PaCMAP for dimensionality reduction visualization. Through these studies, the paper lays the foundation for a more in - depth understanding of the protein sequence - structure - function relationship and provides new perspectives and tools for future biological research.