Nucleotide Amino Acid K-Mer Vector: an Alignment-Free Method for Comparing Genomic Sequences

Xiaona Bao,Lily He,Jingan Cui,Stephen S-T Yau
DOI: https://doi.org/10.4310/cis.2022.v22.n3.a2
2022-01-01
Communications in Information and Systems
Abstract:Evolutionary analysis of genomic data is a valuable issue in the study of bioinformatics, and a great deal of DNA data has become available. In the field of evolutionary analysis, protein sequences are more meaningful than DNA sequences, and the alignment-free methods based on k-mer mean are widely used. However, the dimension of the k-mer vector based on protein sequence is very high. This paper proposes a new Nucleotide Amino Acid K-mer Vector (NAAKV) technique, which converts the DNA sequence to a pseudo amino acid sequence (PAAS). This transformation does not need to find the coding region of the gene sequence, but also reflects the change of nucleotide. Meanwhile, there is a strong correlation between the amino acids, which leads to the types of k-mer are much lower than that of protein sequence, thus the dimension is greatly reduced. To test NAAKV, we carry out phylogenetic analysis of several viruses and bacteria. The traditional k-mer method and alignment-based MUSCLE method are used for comparison on each dataset. Eventually, the results suggest that NAAKV is accurate and time-efficient for phylogenetic analysis and genome classification.
What problem does this paper attempt to address?