Statistical Linear Models in Virus Genomic Alignment-free Classification: Application to Hepatitis C Viruses

Amine M. Remita,Abdoulaye Baniré Diallo
DOI: https://doi.org/10.1109/BIBM47256.2019.8983375
2024-05-29
Abstract:Viral sequence classification is an important task in pathogen detection, epidemiological surveys and evolutionary studies. Statistical learning methods are widely used to classify and identify viral sequences in samples from environments. These methods face several challenges associated with the nature and properties of viral genomes such as recombination, mutation rate and diversity. Also, new generations of sequencing technologies rise other difficulties by generating massive amounts of fragmented sequences. While linear classifiers are often used to classify viruses, there is a lack of exploration of the accuracy space of existing models in the context of alignment free approaches. In this study, we present an exhaustive assessment procedure exploring the power of linear classifiers in genotyping and subtyping partial and complete genomes. It is applied to the Hepatitis C viruses (HCV). Several variables are considered in this investigation such as classifier types (generative and discriminative) and their hyper-parameters (smoothing value and regularization penalty function), the classification task (genotyping and subtyping), the length of the tested sequences (partial and complete) and the length of k-mer words. Overall, several classifiers perform well given a set of precise combination of the experimental variables mentioned above. Finally, we provide the procedure and benchmark data to allow for more robust assessment of classification from virus genomes.
Machine Learning,Genomics
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address several key issues in viral genome classification, particularly in the genotyping and subtyping of Hepatitis C Virus (HCV) without sequence alignment. Specifically, the paper focuses on the following points: 1. **Challenges in Viral Genome Classification**: - Characteristics of viral genomes, such as recombination, mutation rate, and diversity, pose difficulties for classification. - Next-generation sequencing technologies generate a large number of fragmented sequences, further complicating classification. 2. **Evaluation of Linear Classifiers**: - Assessing the performance of different types of linear classifiers (generative and discriminative) in genotyping and subtyping tasks. - Considering different hyperparameter settings, such as smoothing values and regularization penalty functions, as well as k-mer words of different lengths. 3. **Classification of Partial and Complete Genomes**: - Investigating the performance of classifiers trained on complete genomes when classifying partial genome fragments. - Exploring whether model parameters can be estimated using k-mer counts from complete genomes to correctly classify genome fragments without explicitly sampling fragments in the training set. 4. **Benchmark Datasets and Evaluation Methods**: - Providing detailed evaluation procedures and benchmark datasets to enable other researchers to more robustly assess methods for classification from viral genomes. ### Summary By systematically evaluating the performance of different types of linear classifiers in HCV genome classification, the paper aims to provide an efficient and reliable method for viral genome classification. The research findings not only help to understand the strengths and limitations of different classifiers in handling viral genome data but also offer valuable references for future related studies.