Cross-Validation with Antithetic Gaussian Randomization

Sifan Liu,Snigdha Panigrahi,Jake A. Soloff
2024-12-19
Abstract:We introduce a method for performing cross-validation without sample splitting. The method is well-suited for problems where traditional sample splitting is infeasible, such as when data are not assumed to be independently and identically distributed. Even in scenarios where sample splitting is possible, our method offers a computationally efficient alternative for estimating prediction error, achieving comparable or even lower error than standard cross-validation at a significantly reduced computational cost. Our approach constructs train-test data pairs using externally generated Gaussian randomization variables, drawing inspiration from recent randomization techniques such as data-fission and data-thinning. The key innovation lies in a carefully designed correlation structure among these randomization variables, referred to as antithetic Gaussian randomization. This correlation is crucial in maintaining a bounded variance while allowing the bias to vanish, offering an additional advantage over standard cross-validation, whose performance depends heavily on the bias-variance tradeoff dictated by the number of folds. We provide a theoretical analysis of the mean squared error of the proposed estimator, proving that as the level of randomization decreases to zero, the bias converges to zero, while the variance remains bounded and decays linearly with the number of repetitions. This analysis highlights the benefits of the antithetic Gaussian randomization over independent randomization. Simulation studies corroborate our theoretical findings, illustrating the robust performance of our cross-validated estimator across various data types and loss functions.
Methodology,Statistics Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that traditional cross - validation methods are not applicable or inefficient for certain data types. Specifically, the standard cross - validation method relies on sample splitting, but may perform poorly or be infeasible in the following situations: 1. **Non - independent and identically distributed data**: When data is not independent and identically distributed (i.i.d.), such as time - series data or spatially correlated data, sample splitting may destroy the inherent structure of the data. 2. **Imbalanced - class data**: For classification problems, sample splitting may cause some classes to be completely missing in some folds, leading to biases in model training and evaluation. 3. **Fixed - design regression**: In fixed - design regression, subsamples may not fully represent the entire data set. To solve these problems, the paper proposes a new cross - validation method, namely cross - validation using antithetic Gaussian randomization. This method creates training - testing pairs by introducing externally generated Gaussian random variables, without the need for sample splitting. This method is not only applicable to data types that are difficult to handle with standard cross - validation, but also has an advantage in computational efficiency. ### Key innovation points of the method 1. **No sample splitting required**: The new method creates training - testing pairs by adding externally generated Gaussian random variables, avoiding the traditional sample - splitting step. 2. **Controllable bias and variance**: Two user - specified parameters, α and K, are used to control bias and variance respectively. α controls the noise level in the training data, thus affecting the bias; K controls the number of repetitions, thus affecting the variance. 3. **Stable variance**: Even when α is close to zero, the variance of the new method remains stable, which is different from the standard cross - validation method. The latter often increases the variance when reducing the bias. 4. **Theoretical guarantee**: The paper provides a theoretical analysis, proving that as α decreases, the bias converges to zero, while the variance remains bounded and decays linearly. ### Formula representation The core of the new method lies in how to generate training - testing pairs. For the k - th repetition, the training and testing data pairs can be represented as: \[ Y^{(k)}_{\text{train}} = Y+\sqrt{\alpha}\omega^{(k)}, \quad Y^{(k)}_{\text{test}} = Y-\frac{1}{\sqrt{\alpha}}\omega^{(k)} \] where \(\omega^{(k)}\sim N(0,\sigma^{2}I_{n})\) is a Gaussian random variable with a mean of zero and a covariance matrix of \(\sigma^{2}I_{n}\), and there is a specific correlation structure among these random variables, called "antithetic Gaussian randomization". ### Summary This paper aims to provide a new cross - validation method that can effectively estimate the prediction error without the need for sample splitting. By introducing external Gaussian random variables and designing their correlation structures, this method not only improves computational efficiency, but also ensures the stability of the variance while reducing the bias. This is of great significance for dealing with non - independent and identically distributed data, imbalanced - class data, and fixed - design regression problems.