Abstract:We introduce a method for performing cross-validation without sample splitting. The method is well-suited for problems where traditional sample splitting is infeasible, such as when data are not assumed to be independently and identically distributed. Even in scenarios where sample splitting is possible, our method offers a computationally efficient alternative for estimating prediction error, achieving comparable or even lower error than standard cross-validation at a significantly reduced computational cost. Our approach constructs train-test data pairs using externally generated Gaussian randomization variables, drawing inspiration from recent randomization techniques such as data-fission and data-thinning. The key innovation lies in a carefully designed correlation structure among these randomization variables, referred to as antithetic Gaussian randomization. This correlation is crucial in maintaining a bounded variance while allowing the bias to vanish, offering an additional advantage over standard cross-validation, whose performance depends heavily on the bias-variance tradeoff dictated by the number of folds. We provide a theoretical analysis of the mean squared error of the proposed estimator, proving that as the level of randomization decreases to zero, the bias converges to zero, while the variance remains bounded and decays linearly with the number of repetitions. This analysis highlights the benefits of the antithetic Gaussian randomization over independent randomization. Simulation studies corroborate our theoretical findings, illustrating the robust performance of our cross-validated estimator across various data types and loss functions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that traditional cross - validation methods are not applicable or inefficient for certain data types. Specifically, the standard cross - validation method relies on sample splitting, but may perform poorly or be infeasible in the following situations: 1. **Non - independent and identically distributed data**: When data is not independent and identically distributed (i.i.d.), such as time - series data or spatially correlated data, sample splitting may destroy the inherent structure of the data. 2. **Imbalanced - class data**: For classification problems, sample splitting may cause some classes to be completely missing in some folds, leading to biases in model training and evaluation. 3. **Fixed - design regression**: In fixed - design regression, subsamples may not fully represent the entire data set. To solve these problems, the paper proposes a new cross - validation method, namely cross - validation using antithetic Gaussian randomization. This method creates training - testing pairs by introducing externally generated Gaussian random variables, without the need for sample splitting. This method is not only applicable to data types that are difficult to handle with standard cross - validation, but also has an advantage in computational efficiency. ### Key innovation points of the method 1. **No sample splitting required**: The new method creates training - testing pairs by adding externally generated Gaussian random variables, avoiding the traditional sample - splitting step. 2. **Controllable bias and variance**: Two user - specified parameters, α and K, are used to control bias and variance respectively. α controls the noise level in the training data, thus affecting the bias; K controls the number of repetitions, thus affecting the variance. 3. **Stable variance**: Even when α is close to zero, the variance of the new method remains stable, which is different from the standard cross - validation method. The latter often increases the variance when reducing the bias. 4. **Theoretical guarantee**: The paper provides a theoretical analysis, proving that as α decreases, the bias converges to zero, while the variance remains bounded and decays linearly. ### Formula representation The core of the new method lies in how to generate training - testing pairs. For the k - th repetition, the training and testing data pairs can be represented as: \[ Y^{(k)}_{\text{train}} = Y+\sqrt{\alpha}\omega^{(k)}, \quad Y^{(k)}_{\text{test}} = Y-\frac{1}{\sqrt{\alpha}}\omega^{(k)} \] where \(\omega^{(k)}\sim N(0,\sigma^{2}I_{n})\) is a Gaussian random variable with a mean of zero and a covariance matrix of \(\sigma^{2}I_{n}\), and there is a specific correlation structure among these random variables, called "antithetic Gaussian randomization". ### Summary This paper aims to provide a new cross - validation method that can effectively estimate the prediction error without the need for sample splitting. By introducing external Gaussian random variables and designing their correlation structures, this method not only improves computational efficiency, but also ensures the stability of the variance while reducing the bias. This is of great significance for dealing with non - independent and identically distributed data, imbalanced - class data, and fixed - design regression problems.

Cross-Validation with Antithetic Gaussian Randomization

Fast Calculation of Gaussian Process Multiple-Fold Cross-Validation Residuals and their Covariances

Optimizing for Generalization in Machine Learning with Cross-Validation Gradients

Is Cross-Validation the Gold Standard to Evaluate Model Performance?

Bootstrapping the Cross-Validation Estimate

Cross-validation: what does it estimate and how well does it do it?

Bootstrapping the Out-of-sample Predictions for Efficient and Accurate Cross-Validation

Iterative Approximate Cross-Validation

ROTI-GCV: Generalized Cross-Validation for right-ROTationally Invariant Data

Distributional bias compromises leave-one-out cross-validation

A generalized approximate cross validation for smoothing splines with non-Gaussian data

Robust Prediction Interval estimation for Gaussian Processes by Cross-Validation method

Cross-validation in high-dimensional spaces: a lifeline for least-squares models and multi-class LDA

Cross-validation on extreme regions

Fast Cross-Validation via Sequential Testing

Stability-Adjusted Cross-Validation for Sparse Linear Regression

Failures and Successes of Cross-Validation for Early-Stopped Gradient Descent

Cross-Validation for Nonlinear Mixed Effects Models

A Link between Coding Theory and Cross-Validation with Applications

Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization