Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics

Wei Lin,Rui Feng,Hongzhe Li
DOI: https://doi.org/10.1080/01621459.2014.908125
2014-03-18
Abstract:In genetical genomics studies, it is important to jointly analyze gene expression data and genetic variants in exploring their associations with complex traits, where the dimensionality of gene expressions and genetic variants can both be much larger than the sample size. Motivated by such modern applications, we consider the problem of variable selection and estimation in high-dimensional sparse instrumental variables models. To overcome the difficulty of high dimensionality and unknown optimal instruments, we propose a two-stage regularization framework for identifying and estimating important covariate effects while selecting and estimating optimal instruments. The methodology extends the classical two-stage least squares estimator to high dimensions by exploiting sparsity using sparsity-inducing penalty functions in both stages. The resulting procedure is efficiently implemented by coordinate descent optimization. For the representative $L_1$ regularization and a class of concave regularization methods, we establish estimation, prediction, and model selection properties of the two-stage regularized estimators in the high-dimensional setting where the dimensionality of covariates and instruments are both allowed to grow exponentially with the sample size. The practical performance of the proposed method is evaluated by simulation studies and its usefulness is illustrated by an analysis of mouse obesity data. Supplementary materials for this article are available online.
Methodology,Applications
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to use gene expression data and genetic variation to explore their associations with complex traits in high - dimensional genomics research. Specifically, the paper focuses on the application of variable selection and estimation in high - dimensional sparse instrumental variable models. Since the dimensions of gene expression data and genetic variation may be much larger than the sample size, this brings great challenges to the analysis. To this end, the authors propose a two - stage regularization framework for identifying and estimating important covariate effects while selecting and estimating optimal instrumental variables. This method extends the classical two - stage least squares estimator by using sparse - inducing penalty functions in two stages to address the challenges of high - dimensional data and achieves efficient implementation through the coordinate descent optimization algorithm. The paper also theoretically analyzes and proves the estimation, prediction, and model - selection performance of this method in high - dimensional settings, and verifies its practical performance through simulation studies and mouse obesity data analysis.