Numerical sequence representation of DNA sequences and methods to distinguish coding and non-coding sequences in a complete genome

Zuguo Yu,V. Anh,Yu Zhou,Li-qian Zhou
2007-01-01
Abstract:In this presentation we introduce two methods to distinguish coding and non-coding sequences in a complete genome. A numerical sequence representation of DNA sequences is introduced first. There exists a one-to-one correspondence between a DNA sequence and its numerical sequence representation. In the first method, three exponents from a multifractal analysis are selected to construct the parameter space. In the second method, which is based on a Fourier transform approach, three parameters from the power spectrum of the numerical sequence representation are selected to construct the parameter space. Each DNA may be represented by a point in these three-dimensional spaces. We found that the points corresponding to coding and non-coding sequences in the complete genomes of prokaryotes are divided into different regions in both parameter spaces. If the point for a DNA sequence is situated in the region corresponding to coding sequences, the sequence is recognized as a coding sequence; otherwise, the sequence is classified as a non-coding one. The average accuracies using Fisher's discriminant algorithm for coding and non-coding sequences are satisfactory.
Mathematics
What problem does this paper attempt to address?