Evolutionary Trees and the Ising Model on the Bethe Lattice: a Proof of Steel's Conjecture

Constantinos Daskalakis,Elchanan Mossel,Sebastien Roch
DOI: https://doi.org/10.48550/arXiv.math/0509575
2009-07-28
Abstract:A major task of evolutionary biology is the reconstruction of phylogenetic trees from molecular data. The evolutionary model is given by a Markov chain on a tree. Given samples from the leaves of the Markov chain, the goal is to reconstruct the leaf-labelled tree. It is well known that in order to reconstruct a tree on $n$ leaves, sample sequences of length $\Omega(\log n)$ are needed. It was conjectured by M. Steel that for the CFN/Ising evolutionary model, if the mutation probability on all edges of the tree is less than $p^{\ast} = (\sqrt{2}-1)/2^{3/2}$, then the tree can be recovered from sequences of length $O(\log n)$. The value $p^{\ast}$ is given by the transition point for the extremality of the free Gibbs measure for the Ising model on the binary tree. Steel's conjecture was proven by the second author in the special case where the tree is "balanced." The second author also proved that if all edges have mutation probability larger than $p^{\ast}$ then the length needed is $n^{\Omega(1)}$. Here we show that Steel's conjecture holds true for general trees by giving a reconstruction algorithm that recovers the tree from $O(\log n)$-length sequences when the mutation probabilities are discretized and less than $p^\ast$. Our proof and results demonstrate that extremality of the free Gibbs measure on the infinite binary tree, which has been studied before in probability, statistical physics and computer science, determines how distinguishable are Gibbs measures on finite binary trees.
Probability,Computational Engineering, Finance, and Science,Data Structures and Algorithms,Classical Analysis and ODEs,Combinatorics,Statistics Theory,Populations and Evolution
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is a mathematical puzzle regarding phylogenetic tree reconstruction, specifically proving Steel's conjecture. Steel's conjecture states that, under certain conditions (i.e., the mutation probability of all edges is less than a critical value \(p^*\)), an evolutionary tree with \(n\) leaf nodes can be efficiently reconstructed using sequences of length \(O(\log n)\). ### Specific Problem Description: 1. **Phylogenetic Tree Reconstruction**: - An important task in evolutionary biology is to reconstruct phylogenetic trees from molecular data. The evolutionary model is given by a Markov chain on the tree, and given the sample sequences of the leaf nodes, the goal is to reconstruct the tree labeled with the leaves. - It is known that in order to reconstruct a tree with \(n\) leaf nodes, a sample sequence of length \(\Omega(\log n)\) is required. 2. **Steel's Conjecture**: - M. Steel conjectured that for the CFN/Ising model, if the mutation probability of all edges of the tree is less than \(p^* = (\sqrt{2} - 1)/2^{3/2}\), then the tree can be reconstructed from a sequence of length \(O(\log n)\). - \(p^*\) is the phase - transition point of the extremeness of the free Gibbs measure of the Ising model on a binary tree. 3. **Existing Progress**: - Steel's conjecture has been proven in the special case where the tree is "balanced" (i.e., all leaf nodes are at the same distance from the root). - If the mutation probability of all edges is greater than \(p^*\), then the required sequence length is \(n^{\Omega(1)}\). 4. **Contributions of This Paper**: - This paper proves that Steel's conjecture also holds for general trees and presents a reconstruction algorithm. When the mutation probability is discretized and less than \(p^*\), the tree can be recovered from a sequence of length \(O(\log n)\). - The results show that the extremeness of the free Gibbs measure of the Ising model on an infinite binary tree determines the distinguishability of the Gibbs measure on a finite binary tree. ### Key Formulas and Concepts: - **Critical Mutation Probability \(p^*\)**: \[ p^*=\frac{\sqrt{2} - 1}{2^{3/2}}\approx0.15 \] - **Path Metric \(d(v, w)\)**: \[ d(v, w)=\sum_{e\in\text{path}_T(v, w)}d(e) \] - **Transition Matrix \(M_e\)**: \[ M_e = \exp(d(e)Q) \] For the CFN model: \[ Q=\begin{pmatrix} - 1&1\\ 1&- 1 \end{pmatrix} \] For the Jukes - Cantor model: \[ Q_{i,j}=1 - 4\cdot1\{i = j\} \] Through these formulas and concepts, the paper shows how to efficiently reconstruct phylogenetic trees using short sequences and proves that Steel's conjecture holds under more general conditions.