Abstract:Accurate prediction of complex traits is an important task in quantitative genetics that has become increasingly relevant for personalized medicine. Genotypes have traditionally been used for trait prediction using a variety of methods such as mixed models, Bayesian methods, penalized regressions, dimension reductions, and machine learning methods. Recent studies have shown that gene expression levels can produce higher prediction accuracy than genotypes. However, only a few prediction methods were used in these studies. Thus, a comprehensive assessment of methods is needed to fully evaluate the potential of gene expression as a predictor of complex trait phenotypes. Here, we used data from the Genetic Reference Panel (DGRP) to compare the ability of several existing statistical learning methods to predict starvation resistance from gene expression in the two sexes separately. The methods considered differ in assumptions about the distribution of gene effect sizes - ranging from models that assume that every gene affects the trait to more sparse models and their ability to capture gene-gene interactions. We also used functional annotation ( , Gene Ontology (GO)) as an external source of biological information to inform prediction models. The results show that differences in prediction accuracy between methods exist, although they are generally not large. Methods performing variable selection gave higher accuracy in females while methods assuming a more polygenic architecture performed better in males. Incorporating GO annotations further improved prediction accuracy for a few GO terms of biological significance. Biological significance extended to the genes underlying highly predictive GO terms with different genes emerging between sexes. Notably, the Insulin-like Receptor ( ) was prevalent across methods and sexes. Our results confirmed the potential of transcriptomic prediction and highlighted the importance of selecting appropriate methods and strategies in order to achieve accurate predictions.

Comparison of machine learning methods for genomic prediction of selected Arabidopsis thaliana traits

Trait genetic architecture and population structure determine model selection for genomic prediction in natural Arabidopsis Thaliana populations

Genomic prediction in plants: opportunities for ensemble machine learning based approaches

Robust Predictions of Specialized Metabolism Genes Through Machine Learning

Prior Biological Knowledge Improves Genomic Prediction of Growth-Related Traits in Arabidopsis thaliana

Machine learning models outperform deep learning models, provide interpretation and facilitate feature selection for soybean trait prediction

Plant Genotype to Phenotype Prediction Using Machine Learning

Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data

A Benchmarking Between Deep Learning, Support Vector Machine and Bayesian Threshold Best Linear Unbiased Prediction for Predicting Ordinal Traits in Plant Breeding

A review of machine learning models applied to genomic prediction in animal breeding

Genome-Wide Prediction of Complex Traits in Two Outcrossing Plant Species Through Deep Learning and Bayesian Regularized Neural Network

Genomic prediction for agronomic traits in a diverse Flax (Linum usitatissimum L.) germplasm collection

Genetic prediction of quantitative traits: a machine learner's guide focused on height

KAML: improving genomic prediction accuracy of complex traits using machine learning determined parameters

Machine Learning Methods to Analyze Arabidopsis Thaliana Plant Root Growth

Machine learning: A powerful tool for gene function prediction in plants

Machine learning approaches for crop improvement: Leveraging phenotypic and genotypic big data

Using Genetic Distance to Infer the Accuracy of Genomic Prediction

Linking genetic markers and crop model parameters using neural networks to enhance genomic prediction of integrative traits

Comparing statistical learning methods for complex trait prediction from gene expression

Accurate prediction of quantitative traits with failed SNP calls in canola and maize