Trade-off between predictive performance and FDR control for high-dimensional Gaussian model selection

Perrine Lacroix,Marie-Laure Martin
2024-06-28
Abstract:In the context of high-dimensional Gaussian linear regression for ordered variables, we study the variable selection procedure via the minimization of the penalized least-squares criterion. We focus on model selection where the penalty function depends on an unknown multiplicative constant commonly calibrated for prediction. We propose a new proper calibration of this hyperparameter to simultaneously control predictive risk and false discovery rate. We obtain non-asymptotic bounds on the False Discovery Rate with respect to the hyperparameter and we provide an algorithm to calibrate it. This algorithm is based on quantities that can typically be observed in real data applications. The algorithm is validated in an extensive simulation study and is compared with several existing variable selection procedures. Finally, we study an extension of our approach to the case in which an ordering of the variables is not available.
Statistics Theory,Applications,Methodology
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper mainly focuses on the problem of ordered variable selection in high - dimensional Gaussian linear regression. Specifically, the author studies the method of variable selection by minimizing the penalized least - squares criterion and focuses on the penalty function in model selection, which depends on an unknown multiplicative constant and is usually used for prediction calibration. The main objective of the paper is to propose a new hyperparameter calibration method to simultaneously control the predictive risk (PR) and the false discovery rate (FDR). The author derives the non - asymptotic FDR bound with respect to the hyperparameter and provides an algorithm based on observables in practical data applications to calibrate this hyperparameter. ### Key contributions 1. **Simultaneous control of FDR and PR**: - The author proposes a new hyperparameter calibration method that can simultaneously control the predictive risk and the false discovery rate during the model selection process. - Through the calibration of an appropriate penalty function, non - asymptotic control of the predictive risk can be achieved. 2. **Theoretical results**: - Derives the non - asymptotic upper and lower bounds of FDR given the variable order and known variance. - Proves that when the hyperparameter \( K \) tends to infinity, FDR converges to 0 at an exponential rate. 3. **Algorithm verification**: - Proposes an algorithm to calibrate the hyperparameter \( K \) and verifies the effectiveness of this algorithm through extensive simulation studies. - Compares the proposed algorithm with existing variable selection methods and shows its superiority in prediction performance and FDR control. 4. **Extension to non - ordered variable selection**: - Explores how to estimate the variable order by a data - driven method in the absence of a natural variable order and applies it to the model selection process. ### Formulas and symbols - **Model definition**: \[ Y = X\beta^*+\epsilon \] where \( Y\in\mathbb{R}^n \) is the response vector, \( X\in\mathbb{R}^{n\times p} \) is the design matrix, \( \beta^*\in\mathbb{R}^p \) is the true regression coefficient, and \( \epsilon\sim N(0,\sigma^2I_n) \) is the noise term. - **Predictive risk**: \[ PR(m)=\mathbb{E}\left[\|Y - X\hat{\beta}_m\|^2_2\right] \] - **False discovery rate**: \[ FDR(m)=\mathbb{E}\left[\frac{FP(m)}{\max(D_m, 1)}\right] \] where \( FP(m) \) is the number of variables included in model \( m \) but not in the true model \( m^* \), and \( D_m \) is the dimension of model \( m \). - **Penalty function**: \[ \text{pen}(D_m)=K\sigma^2D_m \] - **Selected model**: \[ \hat{m}(K)=\arg\min_{m\in M}\left\{\|Y - X\hat{\beta}_m\|^2_2+K\sigma^2D_m\right\} \] ### Conclusion This paper successfully achieves the simultaneous control of predictive risk and false discovery rate in high - dimensional Gaussian linear regression by proposing a new hyperparameter calibration method. Theoretical results and experimental verification show that this method has high effectiveness and robustness in practical applications. In addition, the paper also explores how to extend this method in the absence of a natural variable order, further broadening its application range.