Robust Predictions of Specialized Metabolism Genes Through Machine Learning
Bethany M. Moore,Peipei Wang,Pengxiang Fan,Bryan Leong,Craig A. Schenck,John P. Lloyd,Melissa D. Lehti-Shiu,Robert L. Last,Eran Pichersky,Shin-Han Shiu
DOI: https://doi.org/10.1073/pnas.1817074116
IF: 11.1
2019-01-01
Proceedings of the National Academy of Sciences
Abstract:Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thaliana as a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220 A. thaliana geneswith previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome.