Rate variation and recurrent sequence errors in pandemic-scale phylogenetics

Nicola De Maio,Myrthe Willemsen,Zihao Guo,Abhratanu Saha,Martin Hunt,Nhan Ly-Trong,Bui Quang Minh,Zamin Iqbal,Nick Goldman
DOI: https://doi.org/10.1101/2024.07.12.603240
2024-07-15
Abstract:Phylogenetic analyses of genome sequences from infectious pathogens reveal essential information regarding their evolution and transmission, as seen during the COVID-19 pandemic. Recently developed pandemic-scale phylogenetic inference methods reduce the computational demand of phylogenetic reconstruction from genomic epidemiological datasets, allowing the analysis of millions of closely related genomes. However, widespread homoplasies, due to recurrent mutations and sequence errors, cause phylogenetic uncertainty and biases. We present new algorithms and models to substantially improve the computational performance and accuracy of pandemic-scale phylogenetics. In particular, we account for, and identify, mutation rate variation and recurrent sequence errors. We reconstruct reliable and public sequence alignment and phylogenetic tree of > 2 million SARS-CoV-2 genomes encapsulating the evolutionary history and global spread of the virus up to February 2023.
Bioinformatics
What problem does this paper attempt to address?