Abstract:ABSTRACT We conducted an in silico analysis to better understand the potential factors impacting host adaptation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in white-tailed deer, humans, and mink due to the strong evidence of sustained transmission within these hosts. Classification models trained on single nucleotide and amino acid differences between samples effectively identified white-tailed deer-, human-, and mink-derived SARS-CoV-2. For example, the balanced accuracy score of Extremely Randomized Trees classifiers was 0.984 ± 0.006. Eighty-eight commonly identified predictive mutations are found at sites under strong positive and negative selective pressure. A large fraction of sites under selection (86.9%) or identified by machine learning (87.1%) are found in genes other than the spike. Some locations encoded by these gene regions are predicted to be B- and T-cell epitopes or are implicated in modulating the immune response suggesting that host adaptation may involve the evasion of the host immune system, modulation of the class-I major-histocompatibility complex, and the diminished recognition of immune epitopes by CD8+ T cells. Our selection and machine learning analysis also identified that silent mutations, such as C7303T and C9430T, play an important role in discriminating deer-derived samples across multiple clades. Finally, our investigation into the origin of the B.1.641 lineage from white-tailed deer in Canada discovered an additional human sequence from Michigan related to the B.1.641 lineage sampled near the emergence of this lineage. These findings demonstrate that machine-learning approaches can be used in combination with evolutionary genomics to identify factors possibly involved in the cross-species transmission of viruses and the emergence of novel viral lineages. IMPORTANCE Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly transmissible virus capable of infecting and establishing itself in human and wildlife populations, such as white-tailed deer. This fact highlights the importance of developing novel ways to identify genetic factors that contribute to its spread and adaptation to new host species. This is especially important since these populations can serve as reservoirs that potentially facilitate the re-introduction of new variants into human populations. In this study, we apply machine learning and phylogenetic methods to uncover biomarkers of SARS-CoV-2 adaptation in mink and white-tailed deer. We find evidence demonstrating that both non-synonymous and silent mutations can be used to differentiate animal-derived sequences from human-derived ones and each other. This evidence also suggests that host adaptation involves the evasion of the immune system and the suppression of antigen presentation. Finally, the methods developed here are general and can be used to investigate host adaptation in viruses other than SARS-CoV-2.

Predicting host species susceptibility to influenza viruses and coronaviruses using genome data and machine learning: a scoping review

Influenza virus genotype to phenotype predictions through machine learning: a systematic review

Utilizing machine learning and hemagglutinin sequences to identify likely hosts of influenza H3Nx viruses

Dive into Machine Learning Algorithms for Influenza Virus Host Prediction with Hemagglutinin Sequences

Machine learning approaches for influenza A virus risk assessment identifies predictive correlates using ferret model in vivo data

Identifying and prioritizing potential human-infecting viruses from their genome sequences

Novel approach for identification of influenza virus host range and zoonotic transmissible sequences by determination of host-related associative positions in viral genome segments

Host and viral determinants of influenza A virus species specificity

Prediction of mammalian virus cross-species transmission based on host proteins

Prediction of virus-host infectious association by supervised learning methods

Machine Learning Methods for Predicting Human-Adaptive Influenza A Viruses Based on Viral Nucleotide Compositions

Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source

Hidden Challenges in Evaluating Spillover Risk of Zoonotic Viruses using Machine Learning Models

Predicting the zoonotic capacity of mammals to transmit SARS-CoV-2

Predicting Hosts Based on Early SARS-CoV-2 Samples and Analyzing the 2020 Pandemic.

SARS-CoV-2 host prediction based on virus-host genetic features

Predicting hosts and cross-species transmission of Streptococcus agalactiae by interpretable machine learning

Investigating the uses of machine learning algorithms to inform risk factor analyses: The example of avian infectious bronchitis virus (IBV) in broiler chickens

Prediction of hospital-acquired influenza using machine learning algorithms: a comparative study

Computational methods in the analysis of SARS-CoV-2 in mammals: A systematic review of the literature

Genetic Adaptation of Influenza A Viruses in Domestic Animals and Their Potential Role in Interspecies Transmission: A Literature Review