Fast protein structure searching using structure graph embeddings

Joe G Greener,Kiarash Jamali
DOI: https://doi.org/10.1101/2022.11.28.518224
2024-04-18
Abstract:Comparing and searching protein structures independent of primary sequence has proved useful for remote homology detection, function annotation and protein classification. Fast and accurate methods to search with structures will be essential to make use of the vast databases that have recently become available, in the same way that fast protein sequence searching underpins much of bioinformatics. We train a simple graph neural network using supervised contrastive learning to learn a low-dimensional embedding of protein structure. The method, called Progres, is available at . It has accuracy comparable to the best current methods and can search the AlphaFold database TED domains in a tenth of a second per query on CPU.
Bioinformatics
What problem does this paper attempt to address?
This paper mainly discusses how to search protein structures quickly to solve the problem of efficient searching for similar structures in large databases. Although current methods are accurate, they are slow when dealing with large databases. The researchers trained a simple graph neural network (GNN) using supervised contrastive learning to learn low-dimensional protein structure embeddings, which is called Progres. Progres can independently compare protein structures at the primary sequence level and achieve fast searching by comparing the cosine similarity of embedding vectors. This method is as accurate as the existing best methods such as Dali and Foldseek-TM, and it can search the TED domains in the AlphaFold database at a speed of about one-tenth of a second per query structure on a CPU. The paper first introduces various methods for protein structure comparison, and then proposes to use GNN and supervised contrastive learning to learn structure embeddings to enable faster comparison between structures. Experimental results show that Progres performs well in detecting remote homology, protein classification, and structure searching, especially in all-beta domains, small domains, and high-contact order domains. Although it performs slightly weaker on membrane proteins and large domains, overall, Progres is more effective in detecting homology than sequence-based methods. In addition, the paper also discusses the influence of embedding dimension on the performance of the model, pointing out that the performance significantly drops when the dimension is below 32. Progres is fast in searching individual structures against large precomputed databases, especially when accelerated with FAISS, it can achieve a search time of about one-tenth of a second per query structure on a CPU. Finally, the authors emphasize the potential of Progres in applications such as protein structure searching, cluster analysis, and design, as well as the advantage of utilizing low-dimensional folding space in structurally rich databases.