THE RESEARCH OF THE OCCURRENCE FREQUENCY DISTRIBUTION OF k-MER IN WHOLE DNA SEQUENCE

WANG Shu-lin,WANG Ji,CHEN Huo-wang,ZHANG Ding-xing
2006-01-01
ACTA BIOPHYSICA SINICA
Abstract:The research of the k-mer distribution in genome is helpful for understanding the relationship between the structure of genome and its function,and it plays an important role in the recognition of repetitive subsequences,the partition into intron and exon and the investigation of genome evolution.After introducing Hao method which allows the depiction of frequency of k-mer in the form of fractal image,a novel method that can generate 3D frequency distribution map of k-mer in genome is proposed,and the advantage of the 3D frequency distribution map is that the difference of the k-mer occurrence frequency is exhibited obviously for biologist.Then the criterion of the partition of occurrence frequency segment is proposed on the basis of the 1D histogram which is transformed from 3D occurrence frequency distribution.1D histogram can show the local feature of the occurrence frequency distribution of k-mer,i.e.the occurrence frequency of k-mer in ultrahigh frequency segment appears discontinuous in integer.The palindromes in forbidden k-mer are roughly studied in forbidden segment.Phenomena of n-order zero interval in ultrahigh frequency is deeply investigated.Moreover,it is proposed that the distribution of n-order zero interval is the mark of the process of genome evolving and many features of the logarithm histogram of occurrence frequency are successfully explained from the view of biology.On the basis of many experiments,it is discovered and validated that the occurrence frequency distribution of k-mer is subjected to non-central F distribution.Applying several non-central F distributions can fit the density distribution of the occurrence frequency of k-mer in genome which has the same number peaks.On the basis of experiments,the comparison between non-central F distribution and Gamma distribution which was proposed to fit genome distribution by Hsieh and Luo is studied through experiments.Due to the complement of the two distributions in fitting genome density distribution,a new distribution which combines non-central F distribution with Gamma distribution is presented,and experiments show that the new distribution is better than any single of the two distributions in fitting genome density distribution.After the relationship between the maximal frequency of k-mer in genome and the length of k-mer and the relationship between the number of different k-mer which occur only once in genome and the length of k-mer are deeply investigated,and it is discovered that the two relationships among many species are consistent,which are the evidences of neutral evolution theory of genome.
What problem does this paper attempt to address?