Abstract:Deep machine learning demonstrates a capacity to uncover evolutionary relationships directly from protein sequences, in effect internalising notions inherent to classical phylogenetic tree inference. We connect these two paradigms by assessing the capacity of protein-based language models (pLMs) to discern phylogenetic relationships without being explicitly trained to do so. We evaluate ESM2, ProtTrans and MSA-Transformer relative to classical phylogenetic methods, while also considering sequence insertions and deletions (indels) across 114 Pfam datasets. The largest ESM2 model tends to outperform other pLMs (including the multimodal ESM3) by recovering phylogenetic relationships among homologous protein sequences in both low- and high-gap settings. pLMs agree with conventional phylogenetic methods in general, but more so for protein families with fewer implied indels, highlighting indels as a key factor differentiating classical phylogenetics from pLMs. We find that pLMs preferentially capture broader as opposed to finer evolutionary relationships within a specific protein family, where ESM2 has a sweet spot for highly divergent sequences, at remote distance. Less than 10\% of neurons are sufficient to broadly recapitulate classical phylogenetic distances; when used in isolation the difference between the paradigms is further diminished. We show these neurons are polysemantic, shared among different homologous families but never fully overlapping. We highlight the potential of ESM2 as a complementary tool for phylogenetic analysis, especially when extending to remote homologs that are difficult to align and imply complex histories of insertions and deletions.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore whether Protein Language Models (pLMs) can learn evolutionary relationships from protein sequences without explicitly training them for such tasks. Specifically, the researchers evaluated the performance of several different pLMs (such as ESM2, ProtTrans, and MSA - Transformer) on different datasets and compared them with classical evolutionary analysis methods. The main problems in the study include: 1. **Can protein language models capture classical evolutionary relationships?** - The researchers evaluated whether pLMs can accurately capture the evolutionary relationships between protein sequences by comparing the embedding vectors generated by pLMs with the evolutionary distance matrices inferred by classical evolutionary models (such as maximum - likelihood methods and Bayesian statistics). 2. **The impact of insertions and deletions (indels) on pLMs** - The researchers specifically focused on the impact of insertions and deletions (indels) on the performance of pLMs, especially on datasets with low - gap and high - gap. They found that pLMs are more reliable when dealing with protein families with fewer indels. 3. **Differences among different types of pLMs in capturing evolutionary relationships** - The researchers compared the abilities of single - sequence pLMs (such as ESM2) and multi - sequence pLMs (such as MSA - Transformer) in capturing evolutionary relationships. The results showed that ESM2 performs best in most cases, especially when dealing with highly divergent sequences. 4. **The performance of pLMs at different evolutionary distances** - The researchers also evaluated the ability of pLMs to capture evolutionary relationships at different evolutionary distances. They found that ESM2 performs particularly well when dealing with remotely homologous sequences. 5. **The role of internal neurons in pLMs** - The researchers further explored the role of internal neurons in pLMs in encoding evolutionary information and found that a small fraction of neurons (less than 10%) are sufficient to approximately reproduce the classical evolutionary distance, and these neurons have an overlapping but not exactly the same phenomenon among different protein families. ### Main conclusions - **ESM2 performs best**: In most cases, the ESM2 model is superior to other pLMs in capturing evolutionary relationships, especially when dealing with sequences with more insertions and deletions. - **The impact of indels**: pLMs are more reliable when dealing with protein families with fewer indels. - **Differences between early and late layers**: The early layers of single - sequence pLMs mainly learn basic characteristics such as amino acid composition, while the later layers focus more on capturing specific evolutionary relationships. - **Complementarity**: pLMs can be used as a complementary tool for classical evolutionary analysis, especially when dealing with remotely homologous sequences that are difficult to align. Through the study of these problems, the authors hope to reveal the similarities and differences between pLMs and classical evolutionary analysis methods and explore their potential applications in the field of bioinformatics.

Do Protein Language Models Learn Phylogeny?

Protein language models learn evolutionary statistics of interacting sequence motifs

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences

Learning the Language of Phylogeny with MSA Transformer

PEvoLM: Protein Sequence Evolutionary Information Language Model

Learning the protein language: Evolution, structure, and function

Exploring evolution-aware & -free protein language models as protein function predictors

Assessing the role of evolutionary information for enhancing protein language model embeddings

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

Long-context Protein Language Model

Are Genomic Language Models All You Need? Exploring Genomic Language Models on Protein Downstream Tasks

InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders

learnMSA2: deep protein multiple alignments with large language and hidden Markov models

Protein language models are biased by unequal sequence sampling across the tree of life

Protein language model pseudolikelihoods capture features of in vivo B cell selection and evolution

Genomic language model predicts protein co-regulation and function

From PSSM to Pre-Trained Language Models

Modeling Protein Using Large-scale Pretrain Language Model

Protein language models meet reduced amino acid alphabets

Protein language model embeddings for fast, accurate, alignment-free protein structure prediction