Ridge regression estimated linear probability model predictions of O-glycosylation in proteins with structural and sequence data

Rajaram Gana,Sona Vasudevan
DOI: https://doi.org/10.1186/s12860-019-0200-9
2019-06-28
BMC Molecular and Cell Biology
Abstract:<h3 class="Heading">Background</h3><p class="Para">To-date, no claim regarding finding a consensus sequon for <em class="EmphasisTypeItalic">O</em>-glycosylation has been made. Thus, predicting the likelihood of <em class="EmphasisTypeItalic">O</em>-glycosylation with sequence and structural information using classical regression analysis is quite difficult. In particular, if a binary response is used to distinguish between <em class="EmphasisTypeItalic">O</em>-glycosylated and non-<em class="EmphasisTypeItalic">O</em>-glycosylated sequences, an appropriate set of non-<em class="EmphasisTypeItalic">O</em>-glycosylatable sequences is hard to find.</p><h3 class="Heading">Results</h3><p class="Para">Three sequences from similar post-translational modifications (PTMs) of proteins occurring at, or very near, the <em class="EmphasisTypeItalic">S/T</em>-site are analyzed: <em class="EmphasisTypeItalic">N</em>-glycosylation, <em class="EmphasisTypeItalic">O</em>-mucin type (<em class="EmphasisTypeItalic">O</em>-GalNAc) glycosylation, and phosphorylation. Results found include: 1) The consensus <em class="EmphasisTypeItalic">composite</em> sequon for <em class="EmphasisTypeItalic">O</em>-glycosylation is: <em class="EmphasisTypeItalic">~(W–S/T–W)</em>, where "~" denotes the "not" operator. 2) The consensus sequon for phosphorylation is <em class="EmphasisTypeItalic">~(W–S/T/Y/H–W);</em> although <em class="EmphasisTypeItalic">W–S/T/Y/H–W</em> is not an absolute inhibitor of phosphorylation. 3) For linear probability model (LPM) estimation, <em class="EmphasisTypeItalic">N</em>-glycosylated sequences are good approximations to non-<em class="EmphasisTypeItalic">O</em>-glycosylatable sequences; although <em class="EmphasisTypeItalic">N – ~P – S/T</em> is not an absolute inhibitor of <em class="EmphasisTypeItalic">O</em>-glycosylation. 4) The selective positioning of an amino acid along the sequence, differentiates the PTMs of proteins. 5) Some <em class="EmphasisTypeItalic">N</em>-glycosylated sequences are also phosphorylated at the <em class="EmphasisTypeItalic">S/T</em>-site in the <em class="EmphasisTypeItalic">N – ~P – S/T</em> sequon. 6) ASA values for <em class="EmphasisTypeItalic">N</em>-glycosylated sequences are stochastically larger than those for <em class="EmphasisTypeItalic">O</em>-GlcNAc glycosylated sequences. 7) Structural attributes (beta turn II, II´, helix, beta bridges, beta hairpin, and the phi angle) are significant LPM predictors of <em class="EmphasisTypeItalic">O</em>-GlcNAc glycosylation. The LPM with sequence <em class="EmphasisTypeItalic">and</em> structural data as explanatory variables yields a Kolmogorov-Smirnov (KS) statistic of 99%. 8) With only sequence data, the KS statistic erodes to 80%, and 21% of out-of-sample <em class="EmphasisTypeItalic">O</em>-GlcNAc glycosylated sequences are mispredicted as not being glycosylated. The 95% confidence interval around this mispredictions rate is 16% to 26%.</p><h3 class="Heading">Conclusions</h3><p class="Para">The data indicates the existence of a consensus sequon for <em class="EmphasisTypeItalic">O</em>-glycosylation; and underscores the germaneness of structural information for predicting the likelihood of <em class="EmphasisTypeItalic">O</em>-glycosylation.</p>
cell biology
What problem does this paper attempt to address?