Multivariate Entropy Distance Method for Distinguishing Coding and Non-coding DNA Sequences

Zhengqing Ouyang
2004-01-01
Abstract:The multivariate entropy distance (MED) method is a new highly efficient and accurate gene identification algorithm, which use the so-called entropy-density profile (EDP) for the global description of a DNA sequence of finite length. It is found the EDPs of coding and non-coding sequences show clearly distinct patterns. An individual sequence display an EDP clearly clustered around its respective mean EDP (coding or non-coding). The rapid convergence property of the partially averaged EDP makes the MED method practical for gene finding with a need for as few as 20 samples for achieving a highly accurate identification of genes on the whole genome. Test on a dozen prokaryotic genomes obtain an overall accuracy of prediction over 99%. The results suggest the interest of multivariate and global description for complex biological systems.
What problem does this paper attempt to address?