Abstract:The use of statistical models to study the impact of weather on crop yield has not ceased to increase. Unfortunately, this type of application is characterized by datasets with a very limited number of samples (typically one sample per year). In general, statistical inference uses three datasets: the training dataset to optimize the model parameters, the validation dataset to select the best model, and the testing dataset to evaluate the model generalization ability. Splitting the overall database into three datasets is often impossible in crop yield modelling due to the limited number of samples. The leave-one-out cross-validation method, or simply leave one out (LOO), is often used to assess model performance or to select among competing models when the sample size is small. However, the model choice is typically made using only the testing dataset, which can be misleading by favouring unnecessarily complex models. The nested cross-validation approach was introduced in machine learning to avoid this problem by truly utilizing three datasets even with limited databases. In this study, we propose one particular implementation of the nested cross-validation, called the nested leave-two-out cross-validation method or simply the leave two out (LTO), to choose the best model with an optimal model selection (using the validation dataset) and estimate the true model quality (using the testing dataset). Two applications are considered: robusta coffee in Cu M'gar (Dak Lak, Vietnam) and grain maize over 96 French departments. In both cases, LOO is misleading by choosing models that are too complex; LTO indicates that simpler models actually perform better when a reliable generalization test is considered. The simple models obtained using the LTO approach have improved yield anomaly forecasting skills in both study crops. This LTO approach can also be used in seasonal forecasting applications. We suggest that the LTO method should become a standard procedure for statistical crop modelling.

Approximate leave-future-out cross-validation for Bayesian time series models

Efficient leave-one-out cross-validation for Bayesian non-factorized normal and Student-t models

Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models

Iterative Approximate Cross-Validation

Bayesian leave-one-out cross-validation for large data

Approximate Cross-validation: Guarantees for Model Assessment and Selection

Approximate Cross-validated Mean Estimates for Bayesian Hierarchical Regression Models

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Approximate Leave-one-out Cross Validation for Regression with $\ell_1$ Regularizers (extended version)

A note on the validity of cross-validation for evaluating autoregressive time series prediction

Approximate Bayesian Forecasting

Cross-validation in nonparametric regression with outliers

On Neighbourhood Cross Validation

Cross validation for uncertain autoregressive model

Cross-validation: what does it estimate and how well does it do it?

Nested leave-two-out cross-validation for the optimal crop yield model selection

Bayesian cross-validation by parallel Markov chain Monte Carlo

Bayesian cross-validation of geostatistical models

Optimal model averaging based on forward-validation

Approximate Bayesian Computation for a Class of Time Series Models

Cross-validation in high-dimensional spaces: a lifeline for least-squares models and multi-class LDA