Abstract:BackgroundMotivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences.ResultsWe provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships.ConclusionOur analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.

Scaling behaviors of CG clusters in coding and noncoding DNA sequences

Scaling Behaviors of CG Clusters for Chromosomes

Statistical properties and fractals of nucleotide clusters in DNA sequences

Scaling Behavior of Nucleotide Cluster in DNA Sequences.

Statistical Properties of Nucleotide Clusters in DNA Sequences

Long-range correlations in DNA sequences using 2D DNA walk based on pairs of sequential nucleotides

Long-Range Correlations In Dna Sequences Using Two-Dimensional Dna Walks

The mathematics of the genetic code reveal that frequency degeneracy leads to exponential scaling in the DNA codon distribution of Homo sapiens

Statistical Properties of Nucleotides in Human Chromosomes 21 and 22

Scaling And Hierarchical Structures In Dna Sequences

Universality and Shannon entropy of codon usage

Violation of the Single-Parameter Scaling Hypothesis in Human Chromosome 22 with Charge Transfer Models

Reconsidering the significance of genomic word frequency

Universal 1/f noise, cross-overs of scaling exponents, and chromosome specific patterns of GC content in DNA sequences of the human genome

Hierarchical Dinucleotide Distribution in Genome along Evolution and Its Effect on Chromatin Packing

Inference of Markovian Properties of Molecular Sequences from NGS Data and Applications to Comparative Genomics

An investigation into inter- and intragenomic variations of graphic genomic signatures

Universal power law behaviors in genomic sequences and evolutionary models

The linear correlation between genome size and the size of the non-transcribing region

Long-Tail Feature of DNA Words Over- and Under-Representation in Coding Sequences

Chance and necessity in chromosomal gene distributions