Interpretable and predictive models based on high-dimensional data in ecology and evolution

Joshua P Jahner,C. Alex Buerkle,Dustin G Gannon,Eliza M Grames,S. Eryn McFarlane,Andrew Siefert,Katherine L Bell,Victoria L DeLeo,Matthew L Forister,Joshua G Harrison,Daniel C Laughlin,Amy C Patterson,Breanna F Powers,Chhaya M Werner,Isabella A Oleksy
DOI: https://doi.org/10.1101/2024.03.15.585297
2024-10-08
Abstract:The proliferation of high-dimensional data in ecology and evolutionary biology raise the promise of statistical and machine learning models that are highly predictive and interpretable. However, high-dimensional data are commonly burdened with an inherent trade-off: in-sample prediction of outcomes will improve as additional predictors are included in the model, but this may come at the cost of poor predictive accuracy and limited generalizability for future or unsampled observations (out-of-sample prediction). To confront this problem of overfitting, sparse models can focus on key predictors by correctly placing low weight on unimportant variables. We competed nine methods to quantify their performance in variable selection and prediction using simulated data with different sample sizes, numbers of predictors, and strengths of effects. Overfitting was typical for many methods and simulation scenarios. Despite this, in-sample and out-of-sample prediction converged on the true predictive target for simulations with more observations, larger causal effects, and fewer predictors. Accurate variable selection to support process-based understanding will be unattainable for many realistic sampling schemes in ecology and evolution. We use our analyses to characterize data attributes for which statistical learning is possible, and illustrate how some sparse methods can achieve predictive accuracy while mitigating and learning the extent of overfitting.
Genomics
What problem does this paper attempt to address?