Classification Of Hcv Infections Through Sequence Image Normalization

Sunitha Basodi,Pelin Burcak Icer,Pavel Skums,Yury Khudyakov,Alexander Zelikovsky,Yi Pan
DOI: https://doi.org/10.1109/ICCABS.2017.8114313
2017-01-01
Abstract:Identification of Hepatitis C virus (HCV) infections is crucial in determining viral outbreaks. HCV has an affinity to lead towards chronic infection with time due to its highly mutable nature. This leads to increase in heterogeneous population of genetically related HCV variants in the affected individuals. To our knowledge, there are no reliable diagnostic assays for distinguishing acute and chronic HCV infections. Providing a robust classification scheme for the staging of viral infection requires identification of prominent features which in this case can be done using domain knowledge. Simple genetic heterogeneity metrics are not sufficient to represent HCV infections accurately as features for the classification algorithms. This is due to complexity of structural development of intra-host populations, which are affected by bouts of selective sweeps and negative selection during chronic infection [1], [2]. Although some machine learning models are known to work well for sequence data for classification problems, their straightforward application to viral genomic data is problematic, since the number of viral sequences and the structures of intra-host viral populations are not consistent across various samples. We propose a novel preprocessing approach to transform irregular viral genomic data into a normalized image data. Such representation allows to apply powerful machine learning algorithms to the problem of classification of recent and chronic HCV infections. Our dataset consists of intra-host HCV populations of a highly heterogeneous genomic region HVR1, collected from 108 recently and 257 chronically infected individuals sampled by next-generation sequencing. We train several classification models using stratified 10-fold cross validation on the transformed image data. SVM classification model achieves the highest accuracy of 98% and also has more than 95% of precision, recall and F1_Score metrics, for both acute and chronically HCV infected individuals.
What problem does this paper attempt to address?