Abstract:The field of population genetics attempts to advance our understanding of evolutionary processes. It has applications, for example, in medical research, wildlife conservation, and - in conjunction with recent advances in ancient DNA sequencing technology - studying human migration patterns over the past few thousand years. The basic toolbox of population genetics includes genealogical tress, which describe the shared evolutionary history among individuals of the same species. They are calculated on the basis of genetic variations. However, in recombining organisms, a single tree is insufficient to describe the evolutionary history of the whole genome. Instead, a collection of correlated trees can be used, where each describes the evolutionary history of a consecutive region of the genome. The current corresponding state of-the-art data structure, tree sequences, compresses these genealogical trees via edit operations when moving from one tree to the next along the genome instead of storing the full, often redundant, description for each tree. We propose a new data structure, genealogical forests, which compresses the set of genealogical trees into a DAG. In this DAG identical subtrees that are shared across the input trees are encoded only once, thereby allowing for straight-forward memoization of intermediate results. Additionally, we provide a C++ implementation of our proposed data structure, called gfkit, which is 2.1 to 11.2 (median 4.0) times faster than the state-of-the-art tool on empirical and simulated datasets at computing important population genetics statistics such as the Allele Frequency Spectrum, Patterson's f, the Fixation Index, Tajima's D, pairwise Lowest Common Ancestors, and others. On Lowest Common Ancestor queries with more than two samples as input, gfkit scales asymptotically better than the state-of-the-art, and is thus up to 990 times faster. In conclusion, our proposed data structure compresses genealogical trees by storing shared subtrees only once, thereby enabling straight-forward memoization of intermediate results, yielding a substantial runtime reduction and a potentially more intuitive data representation over the state-of-the-art. Our improvements will boost the development of novel analyses and models in the field of population genetics and increases scalability to ever-growing genomic datasets.

Calculating and interpreting FST in the genomics era

FSTest: an efficient tool for cross-population fixation index estimation on variant call format files

Fast and accurate joint inference of coancestry parameters for populations and/or individuals

BlockFeST: Bayesian calculation of region-specific FST to detect local adaptation

Estimating hierarchical F–statistics from Pool–Seq data

Error rates in Q_ST--F_ST comparisons depend on genetic architecture and estimation procedures

An explanation for the sister repulsion phenomenon in Patterson’s f-statistics

Inferring drift, genetic differentiation, and admixture graphs from low-depth sequencing data

An explanation for the sister repulsion phenomenon in Patterson's f -statistics

F ST between haploids and diploids in species with discrete ploidy phases

Defining Loci in Restriction-Based Reduced Representation Genomic Data from Nonmodel Species: Sources of Bias and Diagnostics for Optimal Clustering

Distangsd: Fast and Accurate Inference of Genetic Distances for Next-Generation Sequencing Data

Coalescent-based species tree estimation: a stochastic Farris transform

Fine Population Structure Analysis Method for Genomes of Many

A Novel Measure of Genetic-Distance for Highly Polymorphic Tandem Repeat Loci

DivStat: A User-Friendly Tool for Single Nucleotide Polymorphism Analysis of Genomic Diversity

Needles in the Haystack: Identifying Individuals Present in Pooled Genomic Data

grenedalf: population genetic statistics for the next generation of pool sequencing

Memoization on Shared Subtrees Accelerates Computations on Genealogical Forests

Effective number of different populations: A new concept and how to use it