Approximate Leave-one-out Cross Validation for Regression with $\ell_1$ Regularizers (extended version)

Arnab Auddy, Haolin Zou, Kamiar Rahnama Rad, Arian Maleki
2023-10-27
Abstract:The out-of-sample error (OO) is the main quantity of interest in risk estimation and model selection. Leave-one-out cross validation (LO) offers a (nearly) distribution-free yet computationally demanding approach to estimate OO. Recent theoretical work showed that approximate leave-one-out cross validation (ALO) is a computationally efficient and statistically reliable estimate of LO (and OO) for generalized linear models with differentiable regularizers. For problems involving non-differentiable regularizers, despite significant empirical evidence, the theoretical understanding of ALO's error remains unknown. In this paper, we present a novel theory for a wide class of problems in the generalized linear model family with non-differentiable regularizers. We bound the error |ALO - LO| in terms of intuitive metrics such as the size of leave-i-out perturbations in active sets, sample size n, number of features p and regularization parameters. As a consequence, for the $\ell_1$-regularized problems, we show that |ALO - LO| goes to zero as p goes to infinity while n/p and SNR are fixed and bounded.
Methodology,Statistics Theory,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the problem of how to efficiently and accurately estimate the Out-of-Sample Error (OO) in high-dimensional regression problems. Specifically, the paper focuses on how to estimate the error of Leave-One-Out Cross Validation (LO) using Approximate Leave-One-Out Cross Validation (ALO) when employing generalized linear models with non-smooth regularization. ### Background and Motivation 1. **Out-of-Sample Error (OO)**: This is a primary concern in risk estimation and model selection. 2. **Leave-One-Out Cross Validation (LO)**: Although LO is an almost distribution-free method, its computational cost is very high, especially in high-dimensional problems. 3. **Approximate Leave-One-Out Cross Validation (ALO)**: ALO is a more computationally efficient and statistically reliable method for estimating LO and OO. However, for problems with non-smooth regularizers, despite substantial empirical evidence supporting the effectiveness of ALO, its theoretical understanding remains insufficient. ### Main Contributions of the Paper 1. **Theoretical Framework**: The paper proposes a new theoretical framework applicable to a wide range of problems within the family of generalized linear models with non-smooth regularizers. 2. **Error Bound**: The paper derives the error bound between ALO and LO \( |ALO - LO| \) and expresses it in terms of intuitive measures such as the size of the leave-one-out perturbation in the active set, sample size \( n \), number of features \( p \), and regularization parameters. 3. **Specific Results**: For \( \ell_1 \) regularization problems, when \( n \) and \( p \) increase, and \( n/p \) and the signal-to-noise ratio (SNR) remain fixed and finite, the paper proves that \( |ALO - LO| \) converges to 0. ### Major Technical Contributions 1. **Smooth Approximation**: The paper introduces a smooth approximation \( r_\alpha(z) \) to approximate the \( \ell_1 \) norm \( \|z\|_1 \), simplifying the derivation of the error bound through this method. 2. **New Techniques**: The paper develops new techniques to understand the relationship between the two estimators, particularly their variations when using the same sample. ### Conclusion Through rigorous theoretical analysis, the paper fills the gap in the theoretical understanding of ALO error for non-smooth regularizers in high-dimensional regression problems, providing a solid theoretical foundation for practical applications.