Predictive Models of Genetic Redundancy in Arabidopsis thaliana

Siobhan A Cusack,Peipei Wang,Serena G Lotreck,Bethany M Moore,Fanrui Meng,Jeffrey K Conner,Patrick J Krysan,Melissa D Lehti-Shiu,Shin-Han Shiu
DOI: https://doi.org/10.1093/molbev/msab111
IF: 10.7
2021-04-19
Molecular Biology and Evolution
Abstract:Abstract Genetic redundancy refers to a situation where an individual with a loss-of-function mutation in one gene (single mutant) does not show an apparent phenotype until one or more paralogs are also knocked out (double/higher-order mutant). Previous studies have identified some characteristics common among redundant gene pairs, but a predictive model of genetic redundancy incorporating a wide variety of features derived from accumulating omics and mutant phenotype data is yet to be established. In addition, the relative importance of these features for genetic redundancy remains largely unclear. Here, we establish machine learning models for predicting whether a gene pair is likely redundant or not in the model plant Arabidopsis thaliana based on six feature categories: functional annotations, evolutionary conservation including duplication patterns and mechanisms, epigenetic marks, protein properties including posttranslational modifications, gene expression, and gene network properties. The definition of redundancy, data transformations, feature subsets, and machine learning algorithms used significantly affected model performance based on holdout, testing phenotype data. Among the most important features in predicting gene pairs as redundant were having a paralog(s) from recent duplication events, annotation as a transcription factor, downregulation during stress conditions, and having similar expression patterns under stress conditions. We also explored the potential reasons underlying mispredictions and limitations of our studies. This genetic redundancy model sheds light on characteristics that may contribute to long-term maintenance of paralogs, and will ultimately allow for more targeted generation of functionally informative double mutants, advancing functional genomic studies.
genetics & heredity,biochemistry & molecular biology,evolutionary biology
What problem does this paper attempt to address?
This paper aims to solve the problem of genetic redundancy prediction. Specifically, the authors attempt to build machine - learning models to predict whether gene pairs in Arabidopsis thaliana have genetic redundancy. Genetic redundancy refers to the fact that an individual does not show an obvious phenotype when a single gene has a loss - of - function mutation (i.e., a single mutant), and the phenotype change is not shown until one or more homologous genes are also knocked out (i.e., double mutants or multi - mutants). Although previous studies have identified some common features of redundant gene pairs, a model that can comprehensively use multiple features (such as accumulated omics data and mutant phenotype data) to predict genetic redundancy has not been established yet. In addition, the importance of these features for genetic redundancy has not been determined yet. To achieve this goal, the authors established machine - learning models based on six feature categories (functional annotation, evolutionary conservation including duplication patterns and mechanisms, epigenetic marks, protein properties including post - translational modifications, gene expression, and gene network properties). Through different definition criteria, data transformation methods, feature subset selection, and machine - learning algorithms, the authors evaluated the model performance and determined the key factors affecting the model performance. The study found that features such as homologous genes generated by recent duplication events, annotated as transcription factors, down - regulated expression under stress conditions, and similar expression patterns under stress conditions are particularly important for predicting whether gene pairs are redundant. In conclusion, by constructing machine - learning models, this paper not only improves the prediction accuracy of genetic redundancy, but also provides a valuable tool for further functional genomics research, which helps to generate more targeted double mutants with rich functional information.