On the correctness of Maximum Parsimony for data with few substitutions in the NNI neighborhood of phylogenetic trees

Mareike Fischer
2024-10-01
Abstract:Estimating phylogenetic trees, which depict the relationships between different species, from aligned sequence data (such as DNA, RNA, or proteins) is one of the main aims of evolutionary biology. However, tree reconstruction criteria like maximum parsimony do not necessarily lead to unique trees and in some cases even fail to recognize the \enquote{correct} tree (i.e., the tree on which the data was generated). On the other hand, a recent study has shown that for an alignment containing precisely those binary characters (sites) which require up to two substitutions on a given tree, this tree will be the unique maximum parsimony tree. It is the aim of the present paper to generalize this recent result in the following sense: We show that for a tree $T$ with $n$ leaves, as long as $k<\frac{n}{8}+\frac{11}{9}-\frac{1}{18}\sqrt{9\cdot \left(\frac{n}{4}\right)^2+16}$ (or, equivalently, $n>9 k-11+\sqrt{9k^2-22 k+17} $, which in particular holds for all $n\geq 12k$), the maximum parsimony tree for the alignment containing all binary characters which require (up to or precisely) $k$ substitutions on $T$ will be unique in the NNI neighborhood of $T$ and it will coincide with $T$, too. In other words, within the NNI neighborhood of $T$, $T$ is the unique most parsimonious tree for the said alignment. This partially answers a recently published conjecture affirmatively. Additionally, we show that for $n\geq 8$ and for $k$ being in the order of $\frac{n}{2}$, there is always a pair of phylogenetic trees $T$ and $T'$ which are NNI neighbors, but for which the alignment of characters requiring precisely $k$ substitutions each on $T$ in total requires fewer substitutions on $T'$.
Populations and Evolution,Combinatorics
What problem does this paper attempt to address?
### The problems the paper attempts to solve This paper aims to explore whether the Maximum Parsimony (MP) method can correctly identify the unique most parsimonious tree for phylogenetic tree data with a small number of substitutions within the Nearest Neighbor Interchange (NNI) neighborhood. Specifically, the paper attempts to prove the following points: 1. **Problem background**: - The reconstruction of phylogenetic trees is one of the main goals in evolutionary biology. However, the Maximum Parsimony method does not always uniquely determine the correct tree structure, and in some cases, it cannot even identify the true tree that generated the data. - Recent research has shown that for binary characters (sites) that require exactly two substitutions, the Maximum Parsimony method can uniquely determine the tree. 2. **Research objectives**: - The goal of this paper is to generalize this recent result: for a tree \( T \) with \( n \) leaf nodes, as long as the condition \( k < \frac{n}{8}+\frac{11}{9}-\frac{1}{18}\sqrt{\left(\frac{9n}{4}\right)^2 + 16} \) or equivalently \( n > \frac{9k - 11+\sqrt{9k^2 - 22k + 17}}{2} \) (especially for all \( n\geq 12k \) cases) is met, then the maximum parsimony tree of the alignment data containing all binary characters that require at most or exactly \( k \) substitutions will be unique within the NNI neighborhood of this tree and be the same as \( T \). - This part answers in the affirmative a recently published conjecture. 3. **Specific problems**: - The paper attempts to verify whether the Maximum Parsimony method can uniquely recover \( T \) within the NNI neighborhood of tree \( T \) when \( k \) is small (relative to the number of leaf nodes \( n \)). - That is, to prove that if \( T' \) is an NNI neighbor of \( T \), then \( l(A_k(T), T) < l(A_k(T), T') \) (when \( k \) is small enough). ### Formula explanations - \( k \) represents the number of substitutions required for each character. - \( n \) represents the number of leaf nodes of tree \( T \). - \( l(A_k(T), T) \) represents the parsimony score of the alignment data \( A_k(T) \) on tree \( T \). - \( A_k(T) \) represents the set of all binary characters that require at most or exactly \( k \) substitutions on tree \( T \). Through these formulas and conditions, the paper attempts to prove that the Maximum Parsimony method can correctly identify the unique most parsimonious tree under specific conditions, thereby providing theoretical support for the reconstruction of phylogenetic trees.