Abstract:The growing use of model-selection principles in ecology for statistical inference is underpinned by information criteria (IC) and cross-validation (CV) techniques. Although IC techniques, such as Akaike's Information Criterion, have been historically more popular in ecology, CV is a versatile and increasingly used alternative. CV uses data splitting to estimate model scores based on (out-of-sample) predictive performance, which can be used even when it is not possible to derive a likelihood (e.g., machine learning) or count parameters precisely (e.g., mixed-effects models and penalised regression). Here we provide a primer to understanding and applying CV in ecology. We review commonly applied variants of CV, including approximate methods, and make recommendations for their use based on the statistical context. We explain some important -- but often overlooked -- technical aspects of CV, such as bias correction, estimation uncertainty, score selection, and parsimonious selection rules. We also address misconceptions (and truths) about impediments to the use of CV, including computational cost and ease of implementation, and clarify the relationship between CV and information-theoretic approaches to model selection. The paper includes two ecological case studies which illustrate the application of the techniques. We conclude that CV-based model selection should be widely applied to ecological analyses, because of its robust estimation properties and the broad range of situations for which it is applicable. In particular, we recommend using leave-one-out (LOO) or approximate LOO CV to minimise bias, or otherwise K-fold CV using bias correction if K<10. To mitigate overfitting, we recommend calibrated selection via the modified one-standard-error rule which accounts for the predominant cause of overfitting: score-estimation uncertainty.

Approximate Cross-validation: Guarantees for Model Assessment and Selection

Iterative Approximate Cross-Validation

Approximate Cross-validated Mean Estimates for Bayesian Hierarchical Regression Models

Cross-validation: what does it estimate and how well does it do it?

On The Smoothness of Cross-Validation-Based Estimators Of Classifier Performance

Is Cross-Validation the Gold Standard to Evaluate Model Performance?

Efficient, adaptive cross-validation for tuning and comparing models, with application to drug discovery

On the Asymptotic Optimality of Cross-Validation based Hyper-parameter Estimators for Regularized Least Squares Regression Problems

Bootstrapping the Cross-Validation Estimate

Fast and Informative Model Selection using Learning Curve Cross-Validation

Is K-fold cross validation the best model selection method for Machine Learning?

Approximate leave-future-out cross-validation for Bayesian time series models

Cross validation for model selection: a primer with examples from ecology

A survey of cross-validation procedures for model selection

Optimizing for Generalization in Machine Learning with Cross-Validation Gradients

Cross-validation on extreme regions

Overview of model validation for survival regression model with competing risks using melanoma study data

On Neighbourhood Cross Validation

Bootstrapping the Out-of-sample Predictions for Efficient and Accurate Cross-Validation

Confidence intervals for the Cox model test error from cross-validation