Decoding coalescent hidden Markov models in linear time

Kelley Harris,Sara Sheehan,John A. Kamm,Yun S. Song
DOI: https://doi.org/10.48550/arXiv.1403.0858
2014-03-05
Abstract:In many areas of computational biology, hidden Markov models (HMMs) have been used to model local genomic features. In particular, coalescent HMMs have been used to infer ancient population sizes, migration rates, divergence times, and other parameters such as mutation and recombination rates. As more loci, sequences, and hidden states are added to the model, however, the runtime of coalescent HMMs can quickly become prohibitive. Here we present a new algorithm for reducing the runtime of coalescent HMMs from quadratic in the number of hidden time states to linear, without making any additional approximations. Our algorithm can be incorporated into various coalescent HMMs, including the popular method PSMC for inferring variable effective population sizes. Here we implement this algorithm to speed up our demographic inference method diCal, which is equivalent to PSMC when applied to a sample of two haplotypes. We demonstrate that the linear-time method can reconstruct a population size change history more accurately than the quadratic-time method, given similar computation resources. We also apply the method to data from the 1000 Genomes project, inferring a high-resolution history of size changes in the European population.
Populations and Evolution
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the running efficiency of coalescent hidden Markov models (cHMMs) in computational biology. In particular, it aims to reduce the number of hidden time states in the model from quadratic complexity to linear complexity without any additional approximation. This enables cHMMs to process larger data sets more efficiently, thereby inferring more detailed population history information, such as parameters like population size changes, migration rates, divergence times, and mutation and recombination rates. Specifically, the paper proposes a new algorithm that can significantly accelerate the running speed of cHMMs without sacrificing the model's accuracy. This improvement is especially important for handling models with more loci, sequences, and hidden states, because as these factors increase, the running time of cHMMs may quickly become unacceptable. By reducing the running - time complexity from \(O(d^2)\) to \(O(d)\), where \(d\) represents the number of discretized time intervals, the new algorithm not only improves computational efficiency but also allows for a finer time discretization to capture complex population histories, especially recent historical changes. In addition, the paper also shows how to apply this algorithm to the existing population history inference method diCal to accelerate its operation, and proves that the linear - time method can reconstruct the history of population size changes more accurately than the quadratic - time method given similar computational resources. Finally, the author also applies this method to the data of the 1000 Genomes Project to infer the high - resolution size - change history of the European population.