Prokaryote gene data classifier design based on SVM

Li Xiao-xia,Bo Sun,Han Xue-mei,Zhang Ji-hong
DOI: https://doi.org/10.1109/ICBBE.2009.5163250
2009-01-01
Abstract:Gene Recognition is one of the important problems in bioinformatics, including a lot of classic experiments, theory and arithmetic research. The E. coli K12 whole genome sequence and gene mark files from GeneBank were analyzed for later gene prediction. First the gene four distribution types were analyzed. Then the non-coding samples were generated from intervals between the discrete genes and the training set was constructed with all gene samples and nongene fragments. Thirdly the GC ratio and length features probability density of the training samples were plotted using Parzen window method. The average GC ratio of gene and non-coding samples are 0.51 and 0.45 separately. The average length of gene and non-coding samples are 954 and 164 nucleotides separately. At last Fisher linear classifier and Support vector machine (SVM) were used to classify the gene and nongene patterns. The results show that the least squares support vector machines error rate is 14.8%, which is 1.3% less than fisher classifier. ©2009 IEEE.
What problem does this paper attempt to address?