Learning the Language of Phylogeny with MSA Transformer

Ruyi Chen,Gabriel Foley,Mikael Boden
DOI: https://doi.org/10.1101/2024.12.18.629037
2024-12-21
Abstract:Classical phylogenetic inference assumes independence between sites, potentially undermining the accuracy of evolutionary analyses in the presence of epistasis. Some protein language models have the capacity to encode dependencies between sites in conserved structural and functional domains across the protein universe. We employ the MSA Transformer, which takes a multiple sequence alignment (MSA) as an input, and is trained with masked language modeling objectives, to investigate if and how effects of epistasis can be captured to enhance the analysis of phylogenetic relationships. We test whether the MSA Transformer internally encodes evolutionary distances between the sequences in the MSA despite this information not being explicitly available during training. We investigate the model's reliance on information available in columns as opposed to rows in the MSA, by systematically shuffling sequence content. We then use MSA Transformer on both natural and simulated MSAs to reconstruct entire phylogenetic trees with implied ancestral branchpoints, and assess their consistency with trees from maximum likelihood inference. We demonstrate how both previously known and novel evolutionary relationships are available from a "non-classical" approach with very different computational requirements, by reconstructing phylogenetic trees for the RNA virus RNA-dependent RNA polymerase and the nucleo-cytoplasmic large DNA virus domain. We anticipate that MSA Transformer will not replace but rather complement classical phylogenetic inference, to accurately recover the evolutionary history of protein families.
Biology
What problem does this paper attempt to address?