Predicting Thermophilic Nucleotide Sequences Based on Chaos Game Representation Features and Support Vector Machine

JinLong Lu,XueHai Hu,Xiaolei Liu,Feng Shi
DOI: https://doi.org/10.1109/icbbe.2011.5780070
2011-01-01
Abstract:Knowledge of thermophilic mechanisms about some organisms whose optimum growth temperature (OGT) range from 50 to 80 degree plays a major role for helping design stable proteins. How to predict a DNA sequence to be thermophilic is a long but not fairly resolved problem. After downloading 10586 thermophilic bacteria nucleotide sequences and 14261 mesophilic bacteria nucleotide sequences from NCBI database and eliminating the sequences with 95% homologous similarity by CD-HIT, 1638 thermophilic and 2996 mesophilic sequences are remained. Chaos game representation (CGR) can investigate the patterns hiding in DNA sequence, visually revealing previously unknown structure. In this paper, we convert every DNA sequence into a high dimensional vector by CGR algorithm, and predict the DNA sequence thermostability by these CGR features and support vector machine (SVM) with three group experiments: 16-dimensional vector, 64-dimensional vector and 256-dimensional vector, respectively. Each group is evaluated by resubstitution test and 10-fold cross-validation test. In the resubstitution test, the results of all three groups perform highly satisfactions, in which the accuracy achieves 0.9989 and MCC (Matthews Correlation Coefficient) achieves 0.9978. In 10-fold cross-validation test, 256-dimensional vector get the the best: the average accuracy is 0.9088 and average MCC is 0.8169. The results show the effectiveness of the new algorithm.
What problem does this paper attempt to address?