Decoding coalescent hidden Markov models in linear time

Kelley Harris,Sara Sheehan,John A. Kamm,Yun S. Song

DOI: https://doi.org/10.48550/arXiv.1403.0858

2014-03-05

Abstract:In many areas of computational biology, hidden Markov models (HMMs) have been used to model local genomic features. In particular, coalescent HMMs have been used to infer ancient population sizes, migration rates, divergence times, and other parameters such as mutation and recombination rates. As more loci, sequences, and hidden states are added to the model, however, the runtime of coalescent HMMs can quickly become prohibitive. Here we present a new algorithm for reducing the runtime of coalescent HMMs from quadratic in the number of hidden time states to linear, without making any additional approximations. Our algorithm can be incorporated into various coalescent HMMs, including the popular method PSMC for inferring variable effective population sizes. Here we implement this algorithm to speed up our demographic inference method diCal, which is equivalent to PSMC when applied to a sample of two haplotypes. We demonstrate that the linear-time method can reconstruct a population size change history more accurately than the quadratic-time method, given similar computation resources. We also apply the method to data from the 1000 Genomes project, inferring a high-resolution history of size changes in the European population.

Populations and Evolution

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the running efficiency of coalescent hidden Markov models (cHMMs) in computational biology. In particular, it aims to reduce the number of hidden time states in the model from quadratic complexity to linear complexity without any additional approximation. This enables cHMMs to process larger data sets more efficiently, thereby inferring more detailed population history information, such as parameters like population size changes, migration rates, divergence times, and mutation and recombination rates. Specifically, the paper proposes a new algorithm that can significantly accelerate the running speed of cHMMs without sacrificing the model's accuracy. This improvement is especially important for handling models with more loci, sequences, and hidden states, because as these factors increase, the running time of cHMMs may quickly become unacceptable. By reducing the running - time complexity from \(O(d^2)\) to \(O(d)\), where \(d\) represents the number of discretized time intervals, the new algorithm not only improves computational efficiency but also allows for a finer time discretization to capture complex population histories, especially recent historical changes. In addition, the paper also shows how to apply this algorithm to the existing population history inference method diCal to accelerate its operation, and proves that the linear - time method can reconstruct the history of population size changes more accurately than the quadratic - time method given similar computational resources. Finally, the author also applies this method to the data of the 1000 Genomes Project to infer the high - resolution size - change history of the European population.

Decoding coalescent hidden Markov models in linear time

A nonparametric HMM for genetic imputation and coalescent inference

Limits and convergence properties of the sequentially Markovian coalescent

Exact Limits of Inference in Coalescent Models

Multiple merger coalescent inference of effective population size

Fast and accurate haplotype inference with hidden markov model

Accelerated Bayesian inference of population size history from recombining sequence data

Improved inference of population histories by integrating genomic and epigenomic data

Minimal-assumption inference from population-genomic data

An Efficient Bayesian Inference Framework for Coalescent-Based Nonparametric Phylodynamics

Sequential importance sampling for multi-resolution Kingman-Tajima coalescent counting

Bayesian Inference of Dependent Population Dynamics in Coalescent Models

Joint Haplotype Phasing and Genotype Calling of Multiple Individuals Using Haplotype Informative Reads

Large-sample analysis of cost functionals for inference under the coalescent

Sampling through time and phylodynamic inference with coalescent and birth-death models

Understanding Past Population Dynamics: Bayesian Coalescent-Based Modeling with Covariates

Demographic inference using genetic data from a single individual: Separating population size variation from population structure

Computing the joint distribution of the total tree length across loci in populations with variable size

A New Method for Modeling Coalescent Processes with Recombination

Inferring demographic history from a spectrum of shared haplotype lengths

Faster inference of complex demographic models from large allele frequency spectra