Muscle-3D: scalable multiple protein structure alignment

Robert C Edgar,Igor Tolstoy
DOI: https://doi.org/10.1101/2024.10.26.620413
2024-10-28
Abstract:Protein multiple alignment is an essential step in many bioinformatics analysis such as phylogenetic tree estimation, HMM construction and critical residue identification. Structure is conserved between distantly-related proteins where amino acid similarity is weak or undetectable, suggesting that structure-informed sequence alignments might offer advantages over alignments constructed from amino acid sequences alone. The advent of the AI folding era has unleashed millions of high-quality predicted structures, motivating the development and assessment of scalable multiple structure alignment (MStA) methods. Here, we describe Muscle-3D, a new MStA algorithm combining a rich sequence representation of structure context, the Reseek "mega-alphabet", with state-of-the art alignment techniques from Muscle5 including a posterior decoding pair-HMM, consistency transformation, iterative refinement and ensemble construction. We show that Muscle-3D readily scales to thousands of structures. Comparative validation on several benchmark datasets using different quality metrics shows Muscle-3D to be among the higher-scoring methods, but we find that algorithm rankings from different metrics disagree despite low P-values according to the Wilcoxon rank-sum test. We suggest that these conflicts arise from the inherently fuzzy nature of structural alignment, and argue that a universal standard of MStA accuracy is not possible in principle. We describe contact map profiles for visualizing variation in inter-residue distances, and introduce a novel measure of local conformation similarity, LDDT-muw. Muscle-3D software is available at https://github.com/rcedgar/muscle.
Bioinformatics
What problem does this paper attempt to address?