Phylogenetic-based methods for fine-scale classification of PRRSV-2 ORF5 sequences: a comparison of their robustness and reproducibility

Kimberly VanderWaal,Nakarin Pamornchainvakul,Mariana Kikuti,Daniel Linhares,Giovani Trevisan,Jianqiang Zhang,Tavis K Anderson,Michael Zeller,Stephanie Rossow,Derald J Holtkamp,Dennis N Makau,Cesar A Corzo,Igor A.D. Paploski
DOI: https://doi.org/10.1101/2024.05.13.593920
2024-05-15
Abstract:Disease management and epidemiological investigations of porcine reproductive and respiratory syndrome virus-type 2 (PRRSV-2) often rely on grouping together highly related sequences. In the USA, the last five years have seen a major paradigm shift within the swine industry when classifying PRRSV-2, beginning to move away from RFLP (restriction fragment length polymorphisms)-typing and adopting the use of phylogenetic lineage-based classification. However, lineages and sub-lineages are large and genetically diverse, and the rapid mutation rate of PRRSV coupled with the global prevalence of the disease has made it challenging to identify new and emerging variants. Thus, within the lineage system, a dynamic fine-scale classification scheme is needed to provide better resolution on the relatedness of PRRSV-2 viruses to inform disease management and monitoring efforts and facilitate research and communication surrounding circulating PRRSV viruses. Here, we compare potential fine-scale systems for classifying PRRSV-2 variants (i.e., genetic clusters of closely related ORF5 sequences at finer scales than sub-lineage) using a database of 28,730 sequences from 2010 to 2021, representing >55% of the U.S. pig population. In total, we compared 140 approaches that differed in their tree-building method, criteria, and thresholds for defining variants within phylogenetic trees using TreeCluster. Three approaches produced epidemiologically meaningful variants (i.e., ≥5 sequences per cluster), and resulted in reproducible and robust outputs even when the input data or input phylogenies were changed. In the three best performing approaches, the average genetic distance amongst sequences belonging to the same variant was 2.1-2.5%, and the genetic divergence between variants was 2.5-2.7%. Machine learning classification algorithms were also trained to assign new sequences to an existing variant with >95% accuracy, which shows that newly generated sequences could be assigned without repeating the phylogenetic and clustering analyses. Finally, we identified 73 sequence-clusters (dated <1 year apart with close phylogenetic relatedness) associated with circulation events on single farms. The percent of farm sequence-clusters with an ID change was 6.5-8.7% for our best approaches. In contrast, ~43% of farm sequence-clusters had variation in their RFLP-type, further demonstrating how our proposed fine-scale classification system addresses shortcomings of RFLP-typing. Through identifying robust and reproducible classification approaches for PRRSV-2, this work lays the foundation for a fine-scale system that would more reliably group related field viruses and provide better improved clarity for decision-making surrounding disease management.
Bioinformatics
What problem does this paper attempt to address?
The main aim of this paper is to address the classification issues of Porcine Reproductive and Respiratory Syndrome Virus Type 2 (PRRSV-2) in molecular epidemiological surveillance, particularly in fine-grained classification. Specifically, the research objectives can be summarized as follows: 1. **Improve the current classification system**: The currently used Restriction Fragment Length Polymorphism (RFLP) classification method has limitations and cannot accurately reflect the genetic relationships between viruses. Additionally, lineage-based classification, although improved, still struggles to identify new variants due to the rapid mutation rate and global distribution of PRRSV-2, leading to genetic diversity. 2. **Develop a robust and reliable fine-grained classification scheme**: To better support disease management and monitoring efforts, and to facilitate research and communication regarding circulating PRRSV viruses, a dynamic fine-grained classification scheme is needed to enhance the resolution of PRRSV-2 virus relatedness. 3. **Evaluate the robustness and reproducibility of different methods**: By comparing different phylogenetic clustering methods, the study aims to identify methods that can produce epidemiologically meaningful variants (i.e., each cluster contains at least 5 sequences) and maintain result consistency even when input data or lineages change. 4. **Establish machine learning models for sequence classification**: To simplify the classification process of new sequences, the study also trained machine learning algorithms to assign newly generated sequences to existing variants with high accuracy (>95%). 5. **Analyze variant circulation events at the farm level**: By identifying sequence clusters associated with circulation events within individual farms, the effectiveness of the proposed fine-grained classification system is further validated. In summary, the core objective of this study is to provide a more reliable and detailed classification system for PRRSV-2 by comparing and evaluating different phylogenetic clustering methods, to support disease management decisions and scientific research.