Do Protein Language Models Learn Phylogeny?

Sanjana Tule,Gabriel Foley,Mikael Boden
DOI: https://doi.org/10.1101/2024.09.23.614642
2024-09-26
Abstract:Deep machine learning demonstrates a capacity to uncover evolutionary relationships directly from protein sequences, in effect internalising notions inherent to classical phylogenetic tree inference. We connect these two paradigms by assessing the capacity of protein-based language models (pLMs) to discern phylogenetic relationships without being explicitly trained to do so. We evaluate ESM2, ProtTrans and MSA-Transformer relative to classical phylogenetic methods, while also considering sequence insertions and deletions (indels) across 114 Pfam datasets. The largest ESM2 model tends to outperform other pLMs (including the multimodal ESM3) by recovering phylogenetic relationships among homologous protein sequences in both low- and high-gap settings. pLMs agree with conventional phylogenetic methods in general, but more so for protein families with fewer implied indels, highlighting indels as a key factor differentiating classical phylogenetics from pLMs. We find that pLMs preferentially capture broader as opposed to finer evolutionary relationships within a specific protein family, where ESM2 has a sweet spot for highly divergent sequences, at remote distance. Less than 10\% of neurons are sufficient to broadly recapitulate classical phylogenetic distances; when used in isolation the difference between the paradigms is further diminished. We show these neurons are polysemantic, shared among different homologous families but never fully overlapping. We highlight the potential of ESM2 as a complementary tool for phylogenetic analysis, especially when extending to remote homologs that are difficult to align and imply complex histories of insertions and deletions.
Molecular Biology
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore whether Protein Language Models (pLMs) can learn evolutionary relationships from protein sequences without explicitly training them for such tasks. Specifically, the researchers evaluated the performance of several different pLMs (such as ESM2, ProtTrans, and MSA - Transformer) on different datasets and compared them with classical evolutionary analysis methods. The main problems in the study include: 1. **Can protein language models capture classical evolutionary relationships?** - The researchers evaluated whether pLMs can accurately capture the evolutionary relationships between protein sequences by comparing the embedding vectors generated by pLMs with the evolutionary distance matrices inferred by classical evolutionary models (such as maximum - likelihood methods and Bayesian statistics). 2. **The impact of insertions and deletions (indels) on pLMs** - The researchers specifically focused on the impact of insertions and deletions (indels) on the performance of pLMs, especially on datasets with low - gap and high - gap. They found that pLMs are more reliable when dealing with protein families with fewer indels. 3. **Differences among different types of pLMs in capturing evolutionary relationships** - The researchers compared the abilities of single - sequence pLMs (such as ESM2) and multi - sequence pLMs (such as MSA - Transformer) in capturing evolutionary relationships. The results showed that ESM2 performs best in most cases, especially when dealing with highly divergent sequences. 4. **The performance of pLMs at different evolutionary distances** - The researchers also evaluated the ability of pLMs to capture evolutionary relationships at different evolutionary distances. They found that ESM2 performs particularly well when dealing with remotely homologous sequences. 5. **The role of internal neurons in pLMs** - The researchers further explored the role of internal neurons in pLMs in encoding evolutionary information and found that a small fraction of neurons (less than 10%) are sufficient to approximately reproduce the classical evolutionary distance, and these neurons have an overlapping but not exactly the same phenomenon among different protein families. ### Main conclusions - **ESM2 performs best**: In most cases, the ESM2 model is superior to other pLMs in capturing evolutionary relationships, especially when dealing with sequences with more insertions and deletions. - **The impact of indels**: pLMs are more reliable when dealing with protein families with fewer indels. - **Differences between early and late layers**: The early layers of single - sequence pLMs mainly learn basic characteristics such as amino acid composition, while the later layers focus more on capturing specific evolutionary relationships. - **Complementarity**: pLMs can be used as a complementary tool for classical evolutionary analysis, especially when dealing with remotely homologous sequences that are difficult to align. Through the study of these problems, the authors hope to reveal the similarities and differences between pLMs and classical evolutionary analysis methods and explore their potential applications in the field of bioinformatics.