Fast and Accurate Genome Comparison Using Genome Images: the Extended Natural Vector Method.

Shaojun Pei,Wenhui Dong,Xiuqiong Chen,Rong Lucy He,Stephen S. -T. Yau
DOI: https://doi.org/10.1016/j.ympev.2019.106633
IF: 5.019
2019-01-01
Molecular Phylogenetics and Evolution
Abstract:Using numerical methods for genome comparison has always been of importance in bioinformatics. The Chaos Game Representation (CGR) is an effective genome sequence mapping technology, which converts genome sequences to CGR images. To each CGR image, we associate a vector called an Extended Natural Vector (ENV). The ENV is based on the distribution of intensity values. This mapping produces a one-to-one correspondence between CGR images and their ENVs. We define the distance between two DNA sequences as the distance between their associated ENVs. We cluster and classify several datasets including Influenza A viruses, Bacillus genomes, and Conoidea mitochondrial genomes to build their phylogenetic trees. Results show that our ENV combining CGR method (CGR-ENV) compares favorably in classification accuracy and efficiency against the multiple sequence alignment (MSA) method and other alignment-free methods. The research provides significant insights into the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes.
What problem does this paper attempt to address?