Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences

A. N. Gorban,T. G. Popova,A. Yu. Zinovyev,A.N. Gorban,T.G. Popova,A.Yu. Zinovyev
DOI: https://doi.org/10.48550/arXiv.q-bio/0410033
IF: 4.31
2004-10-27
Genomics
Abstract:Coding information is the main source of heterogeneity (non-randomness) in the sequences of bacterial genomes. This information can be naturally modeled by analysing cluster structures in the "in-phase" triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in bacterial genomic sequences and explained its properties. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy. Based on the analysis of 143 completely sequenced bacterial genomes available in Genbank in August 2004, we show that there are four "pure" types of the 7-cluster structure observed. All 143 cluster animated 3D-scatters are collected in a database and is made available on our web-site: http://www.ihes.fr/~zinovyev/7clusters The finding can be readily introduced into any software for gene prediction, sequence alignment or bacterial genomes classification.
What problem does this paper attempt to address?