Tertiary-interaction characters enable fast, model-based structural phylogenetics beyond the twilight zone
Caroline Puente-Lelievre,Ashar J. Malik,Jordan Douglas,David Ascher,Matthew Baker,Jane Allison,Anthony Poole,Daniel Lundin,Matthew Fullmer,Remco Bouckert,Hyunbin Kim,Martin Steinegger,Nicholas Matzke
DOI: https://doi.org/10.1101/2023.12.12.571181
2024-01-09
Abstract:Protein structure is more conserved than protein sequence, and therefore may be useful for phylogenetic inference beyond the “twilight zone” where sequence similarity is highly decayed. Until recently, structural phylogenetics was constrained by the lack of solved structures for most proteins, and the reliance on phylogenetic distance methods which made it difficult to treat inference and uncertainty statistically. AlphaFold has mostly overcome the first problem by making structural predictions readily available. We address the second problem by redeploying a structural alphabet recently developed for Foldseek, a highly-efficient deep homology search program. For each residue in a structure, Foldseek identifies a tertiary interaction closest-neighbor residue in the structure, and classifies it into one of twenty “3Di” states. We test the hypothesis that 3Dis can be used as standard phylogenetic characters using a dataset of 53 structures from the ferritin-like superfamily. We performed 60 IQtree Maximum Likelihood runs to compare structure-free, PDB, and AlphaFold analyses, and default versus custom model sets that include a 3DI-specific rate matrix. Analyses that combine amino acids, 3Di characters, partitioning, and custom models produce the closest match to the structural distances tree of , avoiding the long-branch attraction errors of structure-free analyses. Analyses include standard ultrafast bootstrapping confidence measures, and take minutes instead of weeks to run on desktop computers. These results suggest that structural phylogenetics could soon be routine practice in protein phylogenetics, allowing the re-exploration of many fundamental phylogenetic problems.
Evolutionary Biology
What problem does this paper attempt to address?
The main objective of this paper is to explore how to utilize protein structure information to improve phylogenetic analysis, especially in cases where sequence similarity is very low (the so-called "twilight zone"). Specifically, the authors attempt to address the following core issues:
1. **Integration of Structural Information**: How to effectively integrate protein structure information into phylogenetic analysis to enhance the ability to detect homology and resolve relationships under conditions of low sequence similarity.
2. **Statistical Modeling and Uncertainty Assessment**: How to develop structure-based statistical models and perform uncertainty estimation to statistically validate the inference results.
3. **Computational Efficiency**: How to design computationally efficient algorithms to make structural phylogenetic analysis feasible in practical applications.
To address these issues, the researchers adopted a new method called "3Di states," which is an alphabet of 20 discrete states used to encode tertiary structure interactions of proteins. These states are defined by the Foldseek program and aim to capture the core conserved features of protein structures. By treating these 3Di states as standard phylogenetic characters, the researchers were able to incorporate them into existing model-based phylogenetic estimation workflows, including maximum likelihood methods and bootstrap analysis.
The experiments used 53 structures from the ferritin-like superfamily (including experimentally determined PDB structures and AlphaFold predicted structures) as the test dataset. By comparing phylogenetic analysis results under different configurations (e.g., using only amino acid sequences, using only 3Di states, or using both), the researchers evaluated the role of structural information in phylogenetic reconstruction.
In summary, this paper aims to overcome the limitations of traditional sequence-based phylogenetic analysis in the "twilight zone" by introducing 3Di states and demonstrates how this approach can improve the accuracy and stability of phylogenetic trees.