Combining feature engineering and feature selection to improve the prediction of methionine oxidation sites in proteins
Francisco J. Veredas,Daniel Urda,José L. Subirats,Francisco R. Cantón,Juan C. Aledo
DOI: https://doi.org/10.1007/s00521-018-3655-2
2018-08-03
Neural Computing and Applications
Abstract:Methionine is a proteinogenic amino acid that can be post-translationally modified. It is now well established that reactive oxygen species can oxidise methionine residues within living cells. For a long time, it has been thought that such a modification represents merely an inevitable damage derived from aerobic metabolism. However, several authors have begun to contemplate a possible role for this methionine modification in cell signalling. During the last years, a number of proteomic studies have been carried out with the purpose of detecting proteins containing oxidised methionines. Although these proteomic works allow to pinpoint those methionines being oxidised, they are also arduous, expensive and time-consuming. For these reasons, computational approaches aimed at predicting methionine oxidation sites in proteins become an appealing alternative. In the current work, we address methionine oxidation prediction by combining computational intelligence methods with feature engineering and feature selection techniques to improve the efficacy of several machine learning models, while reducing the number of input characteristics needed to get high accuracy rates. We compare random forests, support vector machines, neural networks and flexible discriminant analysis models. Random forests give the best AUC (0.8124±0.0334\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$0.8124 \pm 0.0334$$\end{document}) and accuracy rates (0.7590±0.0551\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$0.7590 \pm 0.0551$$\end{document}) by using only a reduced set of 16 characteristics. These results surpass the outcomes of previous works. In addition, we present an end-user script that has been developed to take a protein ID as an input and return a list with the oxidation state of all the methionine residues found in the analysed protein. Finally, to illustrate the applicability of this tool, we have selected the human α1\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\alpha 1$$\end{document}-antitrypsin protein as a case study. This protein was selected because it was not present among the set of proteins used to build up the predictive models but the protein has been well characterised experimentally in terms of methionine oxidation. The prediction returned by our script fully matches the empirical evidence. Out of the nine methionine residues found in this protein, our model predicts the oxidation of only two of them, M351 and M358, which have been reported, on the base of mass spectrometry analyses, to be particularly susceptible to oxidation.
computer science, artificial intelligence