Abstract:Specifying, assessing, and selecting amongst candidate statistical models is fundamental to ecological research. Commonly used approaches to model selection are based on predictive scores and include information criteria such as Akaike's Information Criterion, and cross validation. Based on data splitting, cross validation is particularly versatile because it can be used even when it is not possible to derive a likelihood (e.g., many forms of machine learning) or count parameters precisely (e.g., mixed‐effects models). However, much of the literature on cross validation is technical and spread across statistical journals, making it difficult for ecological analysts to assess and choose amongst the wide range of options. Here we provide a comprehensive, accessible review that explains important‐‐but often overlooked‐‐technical aspects of cross validation for model selection, such as: bias correction, estimation uncertainty, choice of scores, and selection rules to mitigate overfitting. We synthesise the relevant statistical advances to make recommendations for the choice of cross‐validation technique and we present two ecological case studies to illustrate their application. In most instances, we recommend using exact or approximate leave‐one‐out cross validation to minimise bias, or otherwise k‐fold with bias correction if k<10. To mitigate overfitting when using cross validation, we recommend calibrated selection via our recently introduced modified one‐standard‐error rule. We advocate for the use of predictive scores in model selection across as range of typical modelling goals, such as exploration, hypothesis testing, and prediction, provided that models are specified in accordance with the stated goal. We also emphasise, as others have done, that inference on parameter estimates is biased if preceded by model selection and instead requires a carefully specified single model or further technical adjustments.

Reliability and Effectiveness of Cross-validation in Feature Selection

Sensitivity Analysis with Cross-Validation for Feature Selection and Manifold Learning

Evolution of the random subset feature selection algorithm for classification problem

A survey of cross-validation procedures for model selection

Selecting a classification method by cross-validation

A Feature Selection Method Based on Feature Grouping and Genetic Algorithm

Don't Waste Your Time: Early Stopping Cross-Validation

Measuring the bias of incorrect application of feature selection when using cross-validation in radiomics

Impact of the Choice of Cross-Validation Techniques on the Results of Machine Learning-Based Diagnostic Applications

Analysis and comparison of feature selection methods towards performance and stability

The All Relevant Feature Selection using Random Forest

Consensus Features Nested Cross-Validation

A New Noisy Random Forest Based Method for Feature Selection

An Adaptive Feature Selection Method for Multi-Class Classification.

Cross validation for model selection: a review with examples from ecology

On the Relationship Between Feature Selection and Classification Accuracy

An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification

A New Method for Redundancy Analysis in Feature Selection

Are screening methods useful in feature selection? An empirical study

Improving Performance of a Group of Classification Algorithms Using Resampling and Feature Selection

Feature Selection: A Data Perspective