A nucleotide composition constraint of genome sequences

Chun-Ting Zhang,Ren Zhang
DOI: https://doi.org/10.1016/j.compbiolchem.2004.02.002
IF: 3.737
2004-01-01
Computational Biology and Chemistry
Abstract:Let a, c, g and t denote the occurrence frequencies of A, C, G and T, respectively, in a genome. We calculated the statistical quantity S=a2+c2+g2+t2 for each of 809 genomes (11 archaea, 42 bacteria, 3 eukaryota, 90 phages, 36 viroids and 627 viruses) and 236 plasmids. We found that S<1/3 is strictly valid for almost all of the above genomes or plasmids. As a direct deduction of the above observation, it is shown that (i) the statistical quantity S is a kind of genome order index, which is negatively correlated with the Shannon H function; (ii) S<1/3 suggests that a minimal value of the Shannon H function is required for each genome; (iii) S defined above would be a new biological statistical quantity, useful to describe the composition features of genomes; (iv) By jointly considering the Chargaff Parity Rule 2, it is shown that the genomic G+C content should be in between 0.211 and 0.789.
What problem does this paper attempt to address?