Abstract:Disentangling evolutionary signal from noise in genomic datasets is essential to building phylogenies. The efficiency of current sequencing platforms and workflows has resulted in a plethora of large-scale phylogenomic datasets where, if signal is weak, it can be easily overwhelmed with non-phylogenetic signal and noise. However, the nature of the latter is not well understood. Although certain factors have been investigated and verified as impacting the accuracy of phylogenetic reconstructions, many others (as well as interactions among different factors) remain understudied. Here we use a large simulation-based dataset and machine learning to better understand the factors, and their interactions, that contribute to species tree error. We trained Random Forest regression models on the features extracted from simulated alignments under known phylogenies to predict the phylogenetic utility of the loci. Loci with the worst utility were then filtered out, resulting in an improved signal-to-noise ratio across the dataset. We investigated the relative importance of different features used by the model, as well as how they correspond to the originally simulated properties. We further used the model on several diverse empirical datasets to predict and subset the least reliable loci and re-infer the phylogenies. We measure the impacts of the subsetting on the overall topologies, difficult nodes identified in the original studies, as well as branch length distribution. Our results suggest that subsetting based on the utility predicted by the model can improve the topological accuracy of the trees and their average statistical support, and limits paralogy and its effects. Although the topology generated from the filtered datasets may not always be dramatically different from that generated from unfiltered data, the worst loci consistently yielded different topologies and worst statistical support, indicating that our protocol identified phylogenetic noise in the empirical data.

[Facial multiple malignant proliferating tricholemmoma: a case report].

Reliable estimation of tree branch lengths using deep neural networks

Harnessing machine learning to guide phylogenetic-tree search algorithms

Phyloformer: Fast, accurate and versatile phylogenetic reconstruction with deep neural networks

Folding rate prediction based on neural network model

IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies

Learning from an Artificial Neural Network in Phylogenetics

Novel Symmetry-preserving Neural Network Model for Phylogenetic Inference

The tree reconstruction game: phylogenetic reconstruction using reinforcement learning

Proceedings: Psychoanalytic concepts and brain stimulation: a consideration of relevance.

A Fast and Scalable Method for Inferring Phylogenetic Networks from Trees by Aligning Lineage Taxon Strings

Toward a semi-supervised learning approach to phylogenetic estimation

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

The Influence of the Number of Tree Searches on Maximum Likelihood Inference in Phylogenomics.

Fast phylogeny reconstruction through learning of ancestral sequences

An efficient deep learning method for amino acid substitution model selection

Predicting locus phylogenetic utility using machine learning

Constructing Phylogenetic Networks via Cherry Picking and Machine Learning

Efficiency of the Neighbor-Joining Method in Reconstructing Deep and Shallow Evolutionary Relationships in Large Phylogenies

On the correctness of Maximum Parsimony for data with few substitutions in the NNI neighborhood of phylogenetic trees