Scores of Generalized Base Properties for Quantitative Sequence-Activity Modelings for E. Coli Promoters Based on Support Vector Machine.

Guizhao Liang,Zhiliang Li
DOI: https://doi.org/10.1016/j.jmgm.2006.12.004
IF: 2.942
2007-01-01
Journal of Molecular Graphics and Modelling
Abstract:A novel base sequence representation technique, namely SGBP (scores of generalized base properties), was derived from principal component analysis of a matrix of 1209 property parameters including 0D, 1D, 2D and 3D information for five bases such as A, C, G, T and U. It was then employed to represent sequence structures of E. coli promoters. Variables which were used as inputs of partial least square (PLS) and support vector machine (SVM) were selected by genetic arithmetic-partial least square. All samples were divided into train set which was applied to develop quantitative sequence-activity modelings (QSAMs) and test set which was used to validate the predictive power of the resulting models according to D-optimal design. Investigation on QSAM by PLS showed properties of base of position −42, −34, −31, −33, −41, −46 and −29 may yield more influence on strengths, which has thus pointed us further into the direction of strong promoters. Parameters of SVM were determined by response surface methodology. Satisfactory results indicated that the simulative and the predictive abilities for the internal and external samples of QSAM by SVM were better than those of PLS. Those results showed that SGBP is a useful structural representation methodology in QSAMs due to its many advantages including plentiful structural information, easy manipulation, and high characterization competence. Moreover, SGBP-GA-SVM route for sequences design and activities prediction of DNA or RNA can further be applied.
What problem does this paper attempt to address?