Abstract:Machine-learning (ML) is revolutionizing the study of ecology and evolution, but the performance of models (and their evaluation) is dependent on the quality of the training and validation data. Currently, we have standard metrics for evaluating model performance (e.g., precision, recall, F1), but these to some extent overlook the ultimate aim of addressing the specific research question to which the model will be applied. As improving performance metrics has diminishing returns, particularly when data is inherently noisy, biologists are often faced with the conundrum of investing more time in maximising performance metrics at the expense of doing the actual research. This leads to the question: how much noise can we accept in our ML models? Here, we start by describing an under-reported source of noise that can cause performance metrics to underestimate true model performance. Specifically, ambiguity between categories or mistakes in labelling of the validation data produces hard ceilings that limit performance metric scores. This common source of error in biological systems means that many models could be performing better than the metrics suggest. Next, we argue and show that imperfect models (e.g. low F1 scores) can still useable. We first propose a simulation framework to evaluate the robustness of a model for hypothesis testing. Second, we show how to determine the utility of the models by supplementing existing performance metrics with 'biological validations' that involve applying ML models to unlabelled data in different ecological contexts for which we can anticipate the outcome. Together, our simulations and case study show that effects sizes and expected biological patterns can be detected even when performance metrics are relatively low (e.g., F1 between 60-70%). In doing so, we provide a roadmap for validation approaches of ML models that are tailored to research in ecology and evolutionary biology.

Perturbation Validation: A New Heuristic to Validate Machine Learning Models

Model Validation Via Uncertainty Propagation and Data Transformations

Assessing Robustness of Machine Learning Models using Covariate Perturbations

Towards Unsupervised Validation of Anomaly-Detection Models

Quantitative model validation techniques: new insights

A New Validation Metric for Models with Correlated Responses Using Limited Experimental and Simulation Data

Moving towards more holistic validation of machine learning-based approaches in ecology and evolution

Proximal Validation Protocol

Validating Unsupervised Machine Learning Techniques for Software Defect Prediction With Generic Metamorphic Testing

Improving Model Robustness by Adaptively Correcting Perturbation Levels with Active Queries.

Train on Validation: Squeezing the Data Lemon

Robust Validation: Confident Predictions Even When Distributions Shift

Learning perturbation sets for robust machine learning

Empirical Comparison between Cross-Validation and Mutation-Validation in Model Selection

A general model validation and testing tool

Stability Evaluation via Distributional Perturbation Analysis

Perturbation Sensitivity Analysis to Detect Unintended Model Biases

Stability Evaluation Through Distributional Perturbation Analysis

On (in)validating environmental models. 1. Principles for formulating a Turing‐like Test for determining when a model is fit‐for purpose

VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation