Deep Generative Models of Protein Structure Uncover Distant Relationships Across a Continuous Fold Space
Eli J. Draizen,Stella Veretnik,Cameron Mura,Philip E. Bourne
DOI: https://doi.org/10.1101/2022.07.29.501943
2024-05-11
Abstract:Our views of fold space implicitly rest upon many assumptions that impact how we analyze, interpret and understand biological systems—from protein structure comparison and classification to function prediction and evolutionary analyses. For instance, is there an optimal granularity at which to view protein structural similarities (e.g., architecture, topology or some other level)? If so, how does it vary with the type of question being asked? Similarly, the discrete/continuous dichotomy of fold space is central in structural bioinformatics, but remains unresolved. Discrete views of fold space bin 'similar' folds into distinct, non-overlapping groups; unfortunately, such binning may inherently miss many remote relationships. While hierarchical systems like CATH, SCOP and ECOD represent major steps forward in protein classification, a scalable, objective and conceptually flexible method, with less reliance on assumptions and heuristics, could enable a more systematic and nuanced exploration of fold space, particularly as regards evolutionarily-distant relationships. Building upon a recent 'Urfold' model of protein structure, we have developed a new approach to analyze protein interrelationships. This framework, termed 'DeepUrfold', is rooted in deep generative modeling via variational Bayesian inference, and we find it to be useful for comparative analysis across the protein universe. Critically, DeepUrfold leverages its deep generative model's learned embeddings, which occupy high-dimensional latent spaces and can be distilled for a given protein in terms of an amalgamated representation that unites sequence, structure, biophysical and phylogenetic properties. Notably, DeepUrfold is structure- , versus being purely structure-based, and its architecture allows each trained model to learn protein features (structural and otherwise) that, in a sense, 'define' different superfamilies. Deploying DeepUrfold with CATH suggests a new, mostly-continuous view of fold space—a view that extends beyond simple 3D structural/geometric similarity, towards the realm of sequence ↔ structure ↔ function properties. We find that such an approach can quantitatively represent and detect evolutionarily-remote relationships that evade existing methods.
Bioinformatics