Sequence alignment using large protein structure alphabets improves sensitivity to remote homologs

Robert C Edgar
DOI: https://doi.org/10.1101/2024.05.24.595840
2024-06-09
Abstract:Recent breakthroughs in protein fold prediction from amino acid sequences have unleashed a deluge of new structures, raising new opportunities for expanding insights into the universe of proteins and pursuing practical applications in bio-engineering and therapeutics while also presenting new challenges to protein search and analysis algorithms. Here, I describe Reseek, a protein alignment algorithm which improves sensitivity in protein homolog detection compared to state-of-the-art methods including DALI, TM-align and Foldseek, with improved speed over Foldseek, the fastest previous method. Reseek is based on alignment of sequences where each residue in the protein backbone is represented by a letter in a novel "mega-alphabet" of 85,899,345,920 (∼ 1011) distinct states. Code is vailable at https://github.com/rcedgar/reseek.
Bioinformatics
What problem does this paper attempt to address?