Intrahost dynamics, together with genetic and phenotypic effects predict the success of viral mutations

Cedric C.S. Tan,Marina Escalera-Zamudio,Alexei Yavlinsky,Lucy van Dorp
DOI: https://doi.org/10.1101/2024.10.18.619070
2024-10-19
Abstract:Predicting the fitness of mutations in the evolution of pathogens is a long-standing and important, yet largely unsolved problem. In this study, we used SARS-CoV-2 as a model system to explore whether the intrahost diversity of viral infections could provide clues on the relative fitness of single amino acid variants (SAVs). To do so, we analysed ~15 million complete genomes and nearly ~8000 sequencing libraries generated from SARS-CoV-2 infections, which were collected at various timepoints during the COVID-19 pandemic. Across timepoints, we found that many successful SAVs were detected in the intrahost diversity of samples collected prior, with a median of 6-40 months between the initial collection dates of samples and the highest frequency seen for these SAVs. Additionally, we found that the co-occurrence of intrahost SAVs significantly captures genetic linkage patterns observed at the interhost level (Pearson's r=0.28-0.45, all p<0.0001). Further, we show that machine learning models can learn highly generalisable intrahost, physiochemical and phenotypic patterns to forecast the future fitness of intrahost SAVs (r2=0.48-0.63). Most of these models performed significantly better when considering genetic linkage (r2=0.53-0.68). Overall, our results document the evolutionary forces shaping the fitness of mutations, which may offer potential to forecast the emergence of future variants and ultimately inform the design of vaccine targets.
Genomics
What problem does this paper attempt to address?
The key problem that this paper attempts to solve is to predict the fitness of mutations during pathogen evolution, especially the success rate of single - amino - acid variants (SAVs) of the SARS - CoV - 2 virus. Specifically, the researchers used SARS - CoV - 2 as a model system to explore whether the diversity of intrahost virus infections could provide clues about the relative fitness of single - amino - acid variants. By analyzing approximately 15 million complete genomes and nearly 8,000 sequencing libraries, which were collected at different time points during the COVID - 19 pandemic, the study found that many successful SAVs had been detected in early samples, and the median time from the initial sample collection date to the appearance of the highest frequency of these variants was 6 - 40 months. In addition, the study also found that the co - occurrence of intrahost SAVs significantly captured the genetic linkage patterns observed between hosts (Pearson’s \(r = 0.28 - 0.45\), all \(p < 0.0001\)). Further research indicates that machine - learning models can learn highly generalized intrahost, physicochemical, and phenotypic patterns to predict the future fitness of intrahost SAVs (\(r^2 = 0.48 - 0.63\)). When genetic linkage is considered, most models perform significantly better (\(r^2 = 0.53 - 0.68\)). Overall, this study documents the evolutionary forces that shape mutation fitness, which may help predict the emergence of future variants and ultimately guide the design of vaccine targets.